[Gluster-devel] [Gluster-users] heal hanging
Pranith Kumar Karampuri
pkarampu at redhat.com
Mon Jan 25 02:27:02 UTC 2016
It seems like there is a lot of finodelk/inodelk traffic. I wonder why
that is. I think the next steps is to collect statedump of the brick
which is taking lot of CPU, using "gluster volume statedump <volname>"
Pranith
On 01/22/2016 08:36 AM, Glomski, Patrick wrote:
> Pranith, attached are stack traces collected every second for 20
> seconds from the high-%cpu glusterfsd process.
>
> Patrick
>
> On Thu, Jan 21, 2016 at 9:46 PM, Glomski, Patrick
> <patrick.glomski at corvidtec.com <mailto:patrick.glomski at corvidtec.com>>
> wrote:
>
> Last entry for get_real_filename on any of the bricks was when we
> turned off the samba gfapi vfs plugin earlier today:
>
> /var/log/glusterfs/bricks/data-brick01a-homegfs.log:[2016-01-21
> 15:13:00.008239] E [server-rpc-fops.c:768:server_getxattr_cbk]
> 0-homegfs-server: 105: GETXATTR /wks_backup
> (40e582d6-b0c7-4099-ba88-9168a3c32ca6)
> (glusterfs.get_real_filename:desktop.ini) ==> (Permission denied)
>
> We'll get back to you with those traces when %cpu spikes again. As
> with most sporadic problems, as soon as you want something out of
> it, the issue becomes harder to reproduce.
>
>
> On Thu, Jan 21, 2016 at 9:21 PM, Pranith Kumar Karampuri
> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>
>
>
> On 01/22/2016 07:25 AM, Glomski, Patrick wrote:
>> Unfortunately, all samba mounts to the gluster volume through
>> the gfapi vfs plugin have been disabled for the last 6 hours
>> or so and frequency of %cpu spikes is increased. We had
>> switched to sharing a fuse mount through samba, but I just
>> disabled that as well. There are no samba shares of this
>> volume now. The spikes now happen every thirty minutes or so.
>> We've resorted to just rebooting the machine with high load
>> for the present.
>
> Could you see if the logs of following type are not at all coming?
> [2016-01-21 15:13:00.005736] E
> [server-rpc-fops.c:768:server_getxattr_cbk] 0-homegfs-server:
> 110: GETXATTR /wks_backup (40e582d6-b0c7-4099-ba88-9168a3c
> 32ca6) (glusterfs.get_real_filename:desktop.ini) ==>
> (Permission denied)
>
> These are operations that failed. Operations that succeed are
> the ones that will scan the directory. But I don't have a way
> to find them other than using tcpdumps.
>
> At the moment I have 2 theories:
> 1) these get_real_filename calls
> 2) [2016-01-21 16:10:38.017828] E
> [server-helpers.c:46:gid_resolve] 0-gid-cache: getpwuid_r(494)
> failed
> "
>
> Yessir they are. Normally, sssd would look to the local cache
> file in /var/lib/sss/db/ first, to get any group or userid
> information, then go out to the domain controller. I put the
> options that we are using on our GFS volumes below… Thanks
> for your help.
>
> We had been running sssd with sssd_nss and sssd_be
> sub-processes on these systems for a long time, under the GFS
> 3.5.2 code, and not run into the problem that David described
> with the high cpu usage on sssd_nss.
>
> *"
> *That was Tom Young's email 1.5 years back when we debugged
> it. But the process which was consuming lot of cpu is
> sssd_nss. So I am not sure if it is same issue. Let us debug
> to see '1)' doesn't happen. The gstack traces I asked for
> should also help.
>
>
> Pranith
>>
>> On Thu, Jan 21, 2016 at 8:49 PM, Pranith Kumar Karampuri
>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>>
>>
>>
>> On 01/22/2016 07:13 AM, Glomski, Patrick wrote:
>>> We use the samba glusterfs virtual filesystem (the
>>> current version provided on download.gluster.org
>>> <http://download.gluster.org>), but no windows clients
>>> connecting directly.
>>
>> Hmm.. Is there a way to disable using this and check if
>> the CPU% still increases? What getxattr of
>> "glusterfs.get_real_filename <filanme>" does is to scan
>> the entire directory looking for strcasecmp(<filname>,
>> <scanned-filename>). If anything matches then it will
>> return the <scanned-filename>. But the problem is the
>> scan is costly. So I wonder if this is the reason for the
>> CPU spikes.
>>
>> Pranith
>>
>>>
>>> On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar Karampuri
>>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>>>
>>> Do you have any windows clients? I see a lot of
>>> getxattr calls for "glusterfs.get_real_filename"
>>> which lead to full readdirs of the directories on
>>> the brick.
>>>
>>> Pranith
>>>
>>> On 01/22/2016 12:51 AM, Glomski, Patrick wrote:
>>>> Pranith, could this kind of behavior be
>>>> self-inflicted by us deleting files directly from
>>>> the bricks? We have done that in the past to clean
>>>> up an issues where gluster wouldn't allow us to
>>>> delete from the mount.
>>>>
>>>> If so, is it feasible to clean them up by running a
>>>> search on the .glusterfs directories directly and
>>>> removing files with a reference count of 1 that are
>>>> non-zero size (or directly checking the xattrs to
>>>> be sure that it's not a DHT link).
>>>>
>>>> find /data/brick01a/homegfs/.glusterfs -type f -not
>>>> -empty -links -2 -exec rm -f "{}" \;
>>>>
>>>> Is there anything I'm inherently missing with that
>>>> approach that will further corrupt the system?
>>>>
>>>>
>>>> On Thu, Jan 21, 2016 at 1:02 PM, Glomski, Patrick
>>>> <patrick.glomski at corvidtec.com
>>>> <mailto:patrick.glomski at corvidtec.com>> wrote:
>>>>
>>>> Load spiked again: ~1200%cpu on gfs02a for
>>>> glusterfsd. Crawl has been running on one of
>>>> the bricks on gfs02b for 25 min or so and users
>>>> cannot access the volume.
>>>>
>>>> I re-listed the xattrop directories as well as
>>>> a 'top' entry and heal statistics. Then I
>>>> restarted the gluster services on gfs02a.
>>>>
>>>> =================== top ===================
>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM
>>>> TIME+ COMMAND
>>>> 8969 root 20 0 2815m 204m 3588 S 1181.0
>>>> 0.6 591:06.93 glusterfsd
>>>>
>>>> =================== xattrop ===================
>>>> /data/brick01a/homegfs/.glusterfs/indices/xattrop:
>>>> xattrop-41f19453-91e4-437c-afa9-3b25614de210
>>>> xattrop-9b815879-2f4d-402b-867c-a6d65087788c
>>>>
>>>> /data/brick02a/homegfs/.glusterfs/indices/xattrop:
>>>> xattrop-70131855-3cfb-49af-abce-9d23f57fb393
>>>> xattrop-dfb77848-a39d-4417-a725-9beca75d78c6
>>>>
>>>> /data/brick01b/homegfs/.glusterfs/indices/xattrop:
>>>> e6e47ed9-309b-42a7-8c44-28c29b9a20f8
>>>> xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125
>>>> xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934
>>>> xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0
>>>>
>>>> /data/brick02b/homegfs/.glusterfs/indices/xattrop:
>>>> xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc
>>>> xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413
>>>>
>>>> /data/brick01a/homegfs/.glusterfs/indices/xattrop:
>>>> xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531
>>>>
>>>> /data/brick02a/homegfs/.glusterfs/indices/xattrop:
>>>> xattrop-7e20fdb1-5224-4b9a-be06-568708526d70
>>>>
>>>> /data/brick01b/homegfs/.glusterfs/indices/xattrop:
>>>> 8034bc06-92cd-4fa5-8aaf-09039e79d2c8
>>>> c9ce22ed-6d8b-471b-a111-b39e57f0b512
>>>> 94fa1d60-45ad-4341-b69c-315936b51e8d
>>>> xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7
>>>>
>>>> /data/brick02b/homegfs/.glusterfs/indices/xattrop:
>>>> xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d
>>>>
>>>>
>>>> =================== heal stats ===================
>>>>
>>>> homegfs [b0-gfsib01a] : Starting time of
>>>> crawl : Thu Jan 21 12:36:45 2016
>>>> homegfs [b0-gfsib01a] : Ending time of crawl :
>>>> Thu Jan 21 12:36:45 2016
>>>> homegfs [b0-gfsib01a] : Type of crawl: INDEX
>>>> homegfs [b0-gfsib01a] : No. of entries healed : 0
>>>> homegfs [b0-gfsib01a] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b0-gfsib01a] : No. of heal failed
>>>> entries : 0
>>>>
>>>> homegfs [b1-gfsib01b] : Starting time of
>>>> crawl : Thu Jan 21 12:36:19 2016
>>>> homegfs [b1-gfsib01b] : Ending time of crawl :
>>>> Thu Jan 21 12:36:19 2016
>>>> homegfs [b1-gfsib01b] : Type of crawl: INDEX
>>>> homegfs [b1-gfsib01b] : No. of entries healed : 0
>>>> homegfs [b1-gfsib01b] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b1-gfsib01b] : No. of heal failed
>>>> entries : 1
>>>>
>>>> homegfs [b2-gfsib01a] : Starting time of
>>>> crawl : Thu Jan 21 12:36:48 2016
>>>> homegfs [b2-gfsib01a] : Ending time of crawl :
>>>> Thu Jan 21 12:36:48 2016
>>>> homegfs [b2-gfsib01a] : Type of crawl: INDEX
>>>> homegfs [b2-gfsib01a] : No. of entries healed : 0
>>>> homegfs [b2-gfsib01a] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b2-gfsib01a] : No. of heal failed
>>>> entries : 0
>>>>
>>>> homegfs [b3-gfsib01b] : Starting time of
>>>> crawl : Thu Jan 21 12:36:47 2016
>>>> homegfs [b3-gfsib01b] : Ending time of crawl :
>>>> Thu Jan 21 12:36:47 2016
>>>> homegfs [b3-gfsib01b] : Type of crawl: INDEX
>>>> homegfs [b3-gfsib01b] : No. of entries healed : 0
>>>> homegfs [b3-gfsib01b] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b3-gfsib01b] : No. of heal failed
>>>> entries : 0
>>>>
>>>> homegfs [b4-gfsib02a] : Starting time of
>>>> crawl : Thu Jan 21 12:36:06 2016
>>>> homegfs [b4-gfsib02a] : Ending time of crawl :
>>>> Thu Jan 21 12:36:06 2016
>>>> homegfs [b4-gfsib02a] : Type of crawl: INDEX
>>>> homegfs [b4-gfsib02a] : No. of entries healed : 0
>>>> homegfs [b4-gfsib02a] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b4-gfsib02a] : No. of heal failed
>>>> entries : 0
>>>>
>>>> homegfs [b5-gfsib02b] : Starting time of
>>>> crawl : Thu Jan 21 12:13:40 2016
>>>> homegfs [b5-gfsib02b] : *** Crawl is in
>>>> progress ***
>>>> homegfs [b5-gfsib02b] : Type of crawl: INDEX
>>>> homegfs [b5-gfsib02b] : No. of entries healed : 0
>>>> homegfs [b5-gfsib02b] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b5-gfsib02b] : No. of heal failed
>>>> entries : 0
>>>>
>>>> homegfs [b6-gfsib02a] : Starting time of
>>>> crawl : Thu Jan 21 12:36:58 2016
>>>> homegfs [b6-gfsib02a] : Ending time of crawl :
>>>> Thu Jan 21 12:36:58 2016
>>>> homegfs [b6-gfsib02a] : Type of crawl: INDEX
>>>> homegfs [b6-gfsib02a] : No. of entries healed : 0
>>>> homegfs [b6-gfsib02a] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b6-gfsib02a] : No. of heal failed
>>>> entries : 0
>>>>
>>>> homegfs [b7-gfsib02b] : Starting time of
>>>> crawl : Thu Jan 21 12:36:50 2016
>>>> homegfs [b7-gfsib02b] : Ending time of crawl :
>>>> Thu Jan 21 12:36:50 2016
>>>> homegfs [b7-gfsib02b] : Type of crawl: INDEX
>>>> homegfs [b7-gfsib02b] : No. of entries healed : 0
>>>> homegfs [b7-gfsib02b] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b7-gfsib02b] : No. of heal failed
>>>> entries : 0
>>>>
>>>>
>>>> ========================================================================================
>>>> I waited a few minutes for the heals to finish
>>>> and ran the heal statistics and info again. one
>>>> file is in split-brain. Aside from the
>>>> split-brain, the load on all systems is down
>>>> now and they are behaving normally.
>>>> glustershd.log is attached. What is going on???
>>>>
>>>> Thu Jan 21 12:53:50 EST 2016
>>>>
>>>> =================== homegfs ===================
>>>>
>>>> homegfs [b0-gfsib01a] : Starting time of
>>>> crawl : Thu Jan 21 12:53:02 2016
>>>> homegfs [b0-gfsib01a] : Ending time of crawl :
>>>> Thu Jan 21 12:53:02 2016
>>>> homegfs [b0-gfsib01a] : Type of crawl: INDEX
>>>> homegfs [b0-gfsib01a] : No. of entries healed : 0
>>>> homegfs [b0-gfsib01a] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b0-gfsib01a] : No. of heal failed
>>>> entries : 0
>>>>
>>>> homegfs [b1-gfsib01b] : Starting time of
>>>> crawl : Thu Jan 21 12:53:38 2016
>>>> homegfs [b1-gfsib01b] : Ending time of crawl :
>>>> Thu Jan 21 12:53:38 2016
>>>> homegfs [b1-gfsib01b] : Type of crawl: INDEX
>>>> homegfs [b1-gfsib01b] : No. of entries healed : 0
>>>> homegfs [b1-gfsib01b] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b1-gfsib01b] : No. of heal failed
>>>> entries : 1
>>>>
>>>> homegfs [b2-gfsib01a] : Starting time of
>>>> crawl : Thu Jan 21 12:53:04 2016
>>>> homegfs [b2-gfsib01a] : Ending time of crawl :
>>>> Thu Jan 21 12:53:04 2016
>>>> homegfs [b2-gfsib01a] : Type of crawl: INDEX
>>>> homegfs [b2-gfsib01a] : No. of entries healed : 0
>>>> homegfs [b2-gfsib01a] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b2-gfsib01a] : No. of heal failed
>>>> entries : 0
>>>>
>>>> homegfs [b3-gfsib01b] : Starting time of
>>>> crawl : Thu Jan 21 12:53:04 2016
>>>> homegfs [b3-gfsib01b] : Ending time of crawl :
>>>> Thu Jan 21 12:53:04 2016
>>>> homegfs [b3-gfsib01b] : Type of crawl: INDEX
>>>> homegfs [b3-gfsib01b] : No. of entries healed : 0
>>>> homegfs [b3-gfsib01b] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b3-gfsib01b] : No. of heal failed
>>>> entries : 0
>>>>
>>>> homegfs [b4-gfsib02a] : Starting time of
>>>> crawl : Thu Jan 21 12:53:33 2016
>>>> homegfs [b4-gfsib02a] : Ending time of crawl :
>>>> Thu Jan 21 12:53:33 2016
>>>> homegfs [b4-gfsib02a] : Type of crawl: INDEX
>>>> homegfs [b4-gfsib02a] : No. of entries healed : 0
>>>> homegfs [b4-gfsib02a] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b4-gfsib02a] : No. of heal failed
>>>> entries : 1
>>>>
>>>> homegfs [b5-gfsib02b] : Starting time of
>>>> crawl : Thu Jan 21 12:53:14 2016
>>>> homegfs [b5-gfsib02b] : Ending time of crawl :
>>>> Thu Jan 21 12:53:15 2016
>>>> homegfs [b5-gfsib02b] : Type of crawl: INDEX
>>>> homegfs [b5-gfsib02b] : No. of entries healed : 0
>>>> homegfs [b5-gfsib02b] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b5-gfsib02b] : No. of heal failed
>>>> entries : 3
>>>>
>>>> homegfs [b6-gfsib02a] : Starting time of
>>>> crawl : Thu Jan 21 12:53:04 2016
>>>> homegfs [b6-gfsib02a] : Ending time of crawl :
>>>> Thu Jan 21 12:53:04 2016
>>>> homegfs [b6-gfsib02a] : Type of crawl: INDEX
>>>> homegfs [b6-gfsib02a] : No. of entries healed : 0
>>>> homegfs [b6-gfsib02a] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b6-gfsib02a] : No. of heal failed
>>>> entries : 0
>>>>
>>>> homegfs [b7-gfsib02b] : Starting time of
>>>> crawl : Thu Jan 21 12:53:09 2016
>>>> homegfs [b7-gfsib02b] : Ending time of crawl :
>>>> Thu Jan 21 12:53:09 2016
>>>> homegfs [b7-gfsib02b] : Type of crawl: INDEX
>>>> homegfs [b7-gfsib02b] : No. of entries healed : 0
>>>> homegfs [b7-gfsib02b] : No. of entries in
>>>> split-brain: 0
>>>> homegfs [b7-gfsib02b] : No. of heal failed
>>>> entries : 0
>>>>
>>>> *** gluster bug in 'gluster volume heal homegfs
>>>> statistics' ***
>>>> *** Use 'gluster volume heal homegfs info'
>>>> until bug is fixed ***
>>>>
>>>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
>>>> Number of entries: 0
>>>>
>>>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
>>>> Number of entries: 0
>>>>
>>>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
>>>> Number of entries: 0
>>>>
>>>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
>>>> Number of entries: 0
>>>>
>>>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
>>>> /users/bangell/.gconfd - Is in split-brain
>>>>
>>>> Number of entries: 1
>>>>
>>>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
>>>> /users/bangell/.gconfd - Is in split-brain
>>>>
>>>> /users/bangell/.gconfd/saved_state
>>>> Number of entries: 2
>>>>
>>>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
>>>> Number of entries: 0
>>>>
>>>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
>>>> Number of entries: 0
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jan 21, 2016 at 11:10 AM, Pranith Kumar
>>>> Karampuri <pkarampu at redhat.com
>>>> <mailto:pkarampu at redhat.com>> wrote:
>>>>
>>>>
>>>>
>>>> On 01/21/2016 09:26 PM, Glomski, Patrick wrote:
>>>>> I should mention that the problem is not
>>>>> currently occurring and there are no heals
>>>>> (output appended). By restarting the
>>>>> gluster services, we can stop the crawl,
>>>>> which lowers the load for a while.
>>>>> Subsequent crawls seem to finish properly.
>>>>> For what it's worth, files/folders that
>>>>> show up in the 'volume info' output during
>>>>> a hung crawl don't seem to be anything out
>>>>> of the ordinary.
>>>>>
>>>>> Over the past four days, the typical time
>>>>> before the problem recurs after
>>>>> suppressing it in this manner is an hour.
>>>>> Last night when we reached out to you was
>>>>> the last time it happened and the load has
>>>>> been low since (a relief). David believes
>>>>> that recursively listing the files (ls
>>>>> -alR or similar) from a client mount can
>>>>> force the issue to happen, but obviously
>>>>> I'd rather not unless we have some precise
>>>>> thing we're looking for. Let me know if
>>>>> you'd like me to attempt to drive the
>>>>> system unstable like that and what I
>>>>> should look for. As it's a production
>>>>> system, I'd rather not leave it in this
>>>>> state for long.
>>>>
>>>> Will it be possible to send glustershd,
>>>> mount logs of the past 4 days? I would like
>>>> to see if this is because of directory
>>>> self-heal going wild (Ravi is working on
>>>> throttling feature for 3.8, which will
>>>> allow to put breaks on self-heal traffic)
>>>>
>>>> Pranith
>>>>
>>>>>
>>>>> [root at gfs01a xattrop]# gluster volume heal
>>>>> homegfs info
>>>>> Brick
>>>>> gfs01a.corvidtec.com:/data/brick01a/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick
>>>>> gfs01b.corvidtec.com:/data/brick01b/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick
>>>>> gfs01a.corvidtec.com:/data/brick02a/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick
>>>>> gfs01b.corvidtec.com:/data/brick02b/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick
>>>>> gfs02a.corvidtec.com:/data/brick01a/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick
>>>>> gfs02b.corvidtec.com:/data/brick01b/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick
>>>>> gfs02a.corvidtec.com:/data/brick02a/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick
>>>>> gfs02b.corvidtec.com:/data/brick02b/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 21, 2016 at 10:40 AM, Pranith
>>>>> Kumar Karampuri <pkarampu at redhat.com
>>>>> <mailto:pkarampu at redhat.com>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 01/21/2016 08:25 PM, Glomski,
>>>>> Patrick wrote:
>>>>>> Hello, Pranith. The typical behavior
>>>>>> is that the %cpu on a glusterfsd
>>>>>> process jumps to number of processor
>>>>>> cores available (800% or 1200%,
>>>>>> depending on the pair of nodes
>>>>>> involved) and the load average on the
>>>>>> machine goes very high (~20). The
>>>>>> volume's heal statistics output shows
>>>>>> that it is crawling one of the bricks
>>>>>> and trying to heal, but this crawl
>>>>>> hangs and never seems to finish.
>>>>>>
>>>>>> The number of files in the xattrop
>>>>>> directory varies over time, so I ran
>>>>>> a wc -l as you requested periodically
>>>>>> for some time and then started
>>>>>> including a datestamped list of the
>>>>>> files that were in the xattrops
>>>>>> directory on each brick to see which
>>>>>> were persistent. All bricks had files
>>>>>> in the xattrop folder, so all results
>>>>>> are attached.
>>>>> Thanks this info is helpful. I don't
>>>>> see a lot of files. Could you give
>>>>> output of "gluster volume heal
>>>>> <volname> info"? Is there any
>>>>> directory in there which is LARGE?
>>>>>
>>>>> Pranith
>>>>>
>>>>>>
>>>>>> Please let me know if there is
>>>>>> anything else I can provide.
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 21, 2016 at 12:01 AM,
>>>>>> Pranith Kumar Karampuri
>>>>>> <pkarampu at redhat.com
>>>>>> <mailto:pkarampu at redhat.com>> wrote:
>>>>>>
>>>>>> hey,
>>>>>> Which process is consuming
>>>>>> so much cpu? I went through the
>>>>>> logs you gave me. I see that the
>>>>>> following files are in gfid
>>>>>> mismatch state:
>>>>>>
>>>>>> <066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,
>>>>>> <1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,
>>>>>> <ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>,
>>>>>>
>>>>>> Could you give me the output of
>>>>>> "ls <brick-path>/indices/xattrop
>>>>>> | wc -l" output on all the bricks
>>>>>> which are acting this way? This
>>>>>> will tell us the number of
>>>>>> pending self-heals on the system.
>>>>>>
>>>>>> Pranith
>>>>>>
>>>>>>
>>>>>> On 01/20/2016 09:26 PM, David
>>>>>> Robinson wrote:
>>>>>>> resending with parsed logs...
>>>>>>>>> I am having issues with 3.6.6
>>>>>>>>> where the load will spike up
>>>>>>>>> to 800% for one of the
>>>>>>>>> glusterfsd processes and the
>>>>>>>>> users can no longer access the
>>>>>>>>> system. If I reboot the node,
>>>>>>>>> the heal will finish normally
>>>>>>>>> after a few minutes and the
>>>>>>>>> system will be responsive,
>>>>>>>>> but a few hours later the
>>>>>>>>> issue will start again. It
>>>>>>>>> look like it is hanging in a
>>>>>>>>> heal and spinning up the load
>>>>>>>>> on one of the bricks. The
>>>>>>>>> heal gets stuck and says it is
>>>>>>>>> crawling and never returns.
>>>>>>>>> After a few minutes of the
>>>>>>>>> heal saying it is crawling,
>>>>>>>>> the load spikes up and the
>>>>>>>>> mounts become unresponsive.
>>>>>>>>> Any suggestions on how to fix
>>>>>>>>> this? It has us stopped cold
>>>>>>>>> as the user can no longer
>>>>>>>>> access the systems when the
>>>>>>>>> load spikes... Logs attached.
>>>>>>>>> System setup info is:
>>>>>>>>> [root at gfs01a ~]# gluster
>>>>>>>>> volume info homegfs
>>>>>>>>>
>>>>>>>>> Volume Name: homegfs
>>>>>>>>> Type: Distributed-Replicate
>>>>>>>>> Volume ID:
>>>>>>>>> 1e32672a-f1b7-4b58-ba94-58c085e59071
>>>>>>>>> Status: Started
>>>>>>>>> Number of Bricks: 4 x 2 = 8
>>>>>>>>> Transport-type: tcp
>>>>>>>>> Bricks:
>>>>>>>>> Brick1:
>>>>>>>>> gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>>> Brick2:
>>>>>>>>> gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>>> Brick3:
>>>>>>>>> gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>>> Brick4:
>>>>>>>>> gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>>> Brick5:
>>>>>>>>> gfsib02a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>>> Brick6:
>>>>>>>>> gfsib02b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>>> Brick7:
>>>>>>>>> gfsib02a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>>> Brick8:
>>>>>>>>> gfsib02b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>>> Options Reconfigured:
>>>>>>>>> performance.io-thread-count: 32
>>>>>>>>> performance.cache-size: 128MB
>>>>>>>>> performance.write-behind-window-size:
>>>>>>>>> 128MB
>>>>>>>>> server.allow-insecure: on
>>>>>>>>> network.ping-timeout: 42
>>>>>>>>> storage.owner-gid: 100
>>>>>>>>> geo-replication.indexing: off
>>>>>>>>> geo-replication.ignore-pid-check:
>>>>>>>>> on
>>>>>>>>> changelog.changelog: off
>>>>>>>>> changelog.fsync-interval: 3
>>>>>>>>> changelog.rollover-time: 15
>>>>>>>>> server.manage-gids: on
>>>>>>>>> diagnostics.client-log-level:
>>>>>>>>> WARNING
>>>>>>>>> [root at gfs01a ~]# rpm -qa |
>>>>>>>>> grep gluster
>>>>>>>>> gluster-nagios-common-0.1.1-0.el6.noarch
>>>>>>>>> glusterfs-fuse-3.6.6-1.el6.x86_64
>>>>>>>>> glusterfs-debuginfo-3.6.6-1.el6.x86_64
>>>>>>>>> glusterfs-libs-3.6.6-1.el6.x86_64
>>>>>>>>> glusterfs-geo-replication-3.6.6-1.el6.x86_64
>>>>>>>>> glusterfs-api-3.6.6-1.el6.x86_64
>>>>>>>>> glusterfs-devel-3.6.6-1.el6.x86_64
>>>>>>>>> glusterfs-api-devel-3.6.6-1.el6.x86_64
>>>>>>>>> glusterfs-3.6.6-1.el6.x86_64
>>>>>>>>> glusterfs-cli-3.6.6-1.el6.x86_64
>>>>>>>>> glusterfs-rdma-3.6.6-1.el6.x86_64
>>>>>>>>> samba-vfs-glusterfs-4.1.11-2.el6.x86_64
>>>>>>>>> glusterfs-server-3.6.6-1.el6.x86_64
>>>>>>>>> glusterfs-extra-xlators-3.6.6-1.el6.x86_64
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Gluster-devel mailing list
>>>>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Gluster-users mailing list
>>>>>> Gluster-users at gluster.org
>>>>>> <mailto:Gluster-users at gluster.org>
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160125/80f29e03/attachment-0001.html>
More information about the Gluster-devel
mailing list