[Gluster-devel] [Gluster-users] heal hanging

Glomski, Patrick patrick.glomski at corvidtec.com
Fri Jan 22 02:46:40 UTC 2016


Last entry for get_real_filename on any of the bricks was when we turned
off the samba gfapi vfs plugin earlier today:

/var/log/glusterfs/bricks/data-brick01a-homegfs.log:[2016-01-21
15:13:00.008239] E [server-rpc-fops.c:768:server_getxattr_cbk]
0-homegfs-server: 105: GETXATTR /wks_backup
(40e582d6-b0c7-4099-ba88-9168a3c32ca6)
(glusterfs.get_real_filename:desktop.ini) ==> (Permission denied)

We'll get back to you with those traces when %cpu spikes again. As with
most sporadic problems, as soon as you want something out of it, the issue
becomes harder to reproduce.


On Thu, Jan 21, 2016 at 9:21 PM, Pranith Kumar Karampuri <
pkarampu at redhat.com> wrote:

>
>
> On 01/22/2016 07:25 AM, Glomski, Patrick wrote:
>
> Unfortunately, all samba mounts to the gluster volume through the gfapi
> vfs plugin have been disabled for the last 6 hours or so and frequency of
> %cpu spikes is increased. We had switched to sharing a fuse mount through
> samba, but I just disabled that as well. There are no samba shares of this
> volume now. The spikes now happen every thirty minutes or so. We've
> resorted to just rebooting the machine with high load for the present.
>
>
> Could you see if the logs of following type are not at all coming?
> [2016-01-21 15:13:00.005736] E [server-rpc-fops.c:768:server_getxattr_cbk]
> 0-homegfs-server: 110: GETXATTR /wks_backup (40e582d6-b0c7-4099-ba88-9168a3c
> 32ca6) (glusterfs.get_real_filename:desktop.ini) ==> (Permission denied)
>
> These are operations that failed. Operations that succeed are the ones
> that will scan the directory. But I don't have a way to find them other
> than using tcpdumps.
>
> At the moment I have 2 theories:
> 1) these get_real_filename calls
> 2) [2016-01-21 16:10:38.017828] E [server-helpers.c:46:gid_resolve]
> 0-gid-cache: getpwuid_r(494) failed
> "
>
> Yessir they are.  Normally, sssd would look to the local cache file in
> /var/lib/sss/db/ first, to get any group or userid information, then go out
> to the domain controller.  I put the options that we are using on our GFS
> volumes below…  Thanks for your help.
>
>
>
> We had been running sssd with sssd_nss and sssd_be sub-processes on these
> systems for a long time, under the GFS 3.5.2 code, and not run into the
> problem that David described with the high cpu usage on sssd_nss.
>
> *" *That was Tom Young's email 1.5 years back when we debugged it. But
> the process which was consuming lot of cpu is sssd_nss. So I am not sure if
> it is same issue. Let us debug to see '1)' doesn't happen. The gstack
> traces I asked for should also help.
>
>
> Pranith
>
>
> On Thu, Jan 21, 2016 at 8:49 PM, Pranith Kumar Karampuri <
> pkarampu at redhat.com> wrote:
>
>>
>>
>> On 01/22/2016 07:13 AM, Glomski, Patrick wrote:
>>
>> We use the samba glusterfs virtual filesystem (the current version
>> provided on download.gluster.org), but no windows clients connecting
>> directly.
>>
>>
>> Hmm.. Is there a way to disable using this and check if the CPU% still
>> increases? What getxattr of "glusterfs.get_real_filename <filanme>" does is
>> to scan the entire directory looking for strcasecmp(<filname>,
>> <scanned-filename>). If anything matches then it will return the
>> <scanned-filename>. But the problem is the scan is costly. So I wonder if
>> this is the reason for the CPU spikes.
>>
>> Pranith
>>
>>
>> On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar Karampuri <
>> pkarampu at redhat.com> wrote:
>>
>>> Do you have any windows clients? I see a lot of getxattr calls for
>>> "glusterfs.get_real_filename" which lead to full readdirs of the
>>> directories on the brick.
>>>
>>> Pranith
>>>
>>> On 01/22/2016 12:51 AM, Glomski, Patrick wrote:
>>>
>>> Pranith, could this kind of behavior be self-inflicted by us deleting
>>> files directly from the bricks? We have done that in the past to clean up
>>> an issues where gluster wouldn't allow us to delete from the mount.
>>>
>>> If so, is it feasible to clean them up by running a search on the
>>> .glusterfs directories directly and removing files with a reference count
>>> of 1 that are non-zero size (or directly checking the xattrs to be sure
>>> that it's not a DHT link).
>>>
>>> find /data/brick01a/homegfs/.glusterfs -type f -not -empty -links -2
>>> -exec rm -f "{}" \;
>>>
>>> Is there anything I'm inherently missing with that approach that will
>>> further corrupt the system?
>>>
>>>
>>> On Thu, Jan 21, 2016 at 1:02 PM, Glomski, Patrick <
>>> patrick.glomski at corvidtec.com> wrote:
>>>
>>>> Load spiked again: ~1200%cpu on gfs02a for glusterfsd. Crawl has been
>>>> running on one of the bricks on gfs02b for 25 min or so and users cannot
>>>> access the volume.
>>>>
>>>> I re-listed the xattrop directories as well as a 'top' entry and heal
>>>> statistics. Then I restarted the gluster services on gfs02a.
>>>>
>>>> =================== top ===================
>>>> PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>>> COMMAND
>>>>  8969 root      20   0 2815m 204m 3588 S 1181.0  0.6 591:06.93
>>>> glusterfsd
>>>>
>>>> =================== xattrop ===================
>>>> /data/brick01a/homegfs/.glusterfs/indices/xattrop:
>>>> xattrop-41f19453-91e4-437c-afa9-3b25614de210
>>>> xattrop-9b815879-2f4d-402b-867c-a6d65087788c
>>>>
>>>> /data/brick02a/homegfs/.glusterfs/indices/xattrop:
>>>> xattrop-70131855-3cfb-49af-abce-9d23f57fb393
>>>> xattrop-dfb77848-a39d-4417-a725-9beca75d78c6
>>>>
>>>> /data/brick01b/homegfs/.glusterfs/indices/xattrop:
>>>> e6e47ed9-309b-42a7-8c44-28c29b9a20f8
>>>> xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125
>>>> xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934
>>>> xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0
>>>>
>>>> /data/brick02b/homegfs/.glusterfs/indices/xattrop:
>>>> xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc
>>>> xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413
>>>>
>>>> /data/brick01a/homegfs/.glusterfs/indices/xattrop:
>>>> xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531
>>>>
>>>> /data/brick02a/homegfs/.glusterfs/indices/xattrop:
>>>> xattrop-7e20fdb1-5224-4b9a-be06-568708526d70
>>>>
>>>> /data/brick01b/homegfs/.glusterfs/indices/xattrop:
>>>> 8034bc06-92cd-4fa5-8aaf-09039e79d2c8
>>>> c9ce22ed-6d8b-471b-a111-b39e57f0b512
>>>> 94fa1d60-45ad-4341-b69c-315936b51e8d
>>>> xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7
>>>>
>>>> /data/brick02b/homegfs/.glusterfs/indices/xattrop:
>>>> xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d
>>>>
>>>>
>>>> =================== heal stats ===================
>>>>
>>>> homegfs [b0-gfsib01a] : Starting time of crawl       : Thu Jan 21
>>>> 12:36:45 2016
>>>> homegfs [b0-gfsib01a] : Ending time of crawl         : Thu Jan 21
>>>> 12:36:45 2016
>>>> homegfs [b0-gfsib01a] : Type of crawl: INDEX
>>>> homegfs [b0-gfsib01a] : No. of entries healed        : 0
>>>> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0
>>>> homegfs [b0-gfsib01a] : No. of heal failed entries   : 0
>>>>
>>>> homegfs [b1-gfsib01b] : Starting time of crawl       : Thu Jan 21
>>>> 12:36:19 2016
>>>> homegfs [b1-gfsib01b] : Ending time of crawl         : Thu Jan 21
>>>> 12:36:19 2016
>>>> homegfs [b1-gfsib01b] : Type of crawl: INDEX
>>>> homegfs [b1-gfsib01b] : No. of entries healed        : 0
>>>> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0
>>>> homegfs [b1-gfsib01b] : No. of heal failed entries   : 1
>>>>
>>>> homegfs [b2-gfsib01a] : Starting time of crawl       : Thu Jan 21
>>>> 12:36:48 2016
>>>> homegfs [b2-gfsib01a] : Ending time of crawl         : Thu Jan 21
>>>> 12:36:48 2016
>>>> homegfs [b2-gfsib01a] : Type of crawl: INDEX
>>>> homegfs [b2-gfsib01a] : No. of entries healed        : 0
>>>> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0
>>>> homegfs [b2-gfsib01a] : No. of heal failed entries   : 0
>>>>
>>>> homegfs [b3-gfsib01b] : Starting time of crawl       : Thu Jan 21
>>>> 12:36:47 2016
>>>> homegfs [b3-gfsib01b] : Ending time of crawl         : Thu Jan 21
>>>> 12:36:47 2016
>>>> homegfs [b3-gfsib01b] : Type of crawl: INDEX
>>>> homegfs [b3-gfsib01b] : No. of entries healed        : 0
>>>> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0
>>>> homegfs [b3-gfsib01b] : No. of heal failed entries   : 0
>>>>
>>>> homegfs [b4-gfsib02a] : Starting time of crawl       : Thu Jan 21
>>>> 12:36:06 2016
>>>> homegfs [b4-gfsib02a] : Ending time of crawl         : Thu Jan 21
>>>> 12:36:06 2016
>>>> homegfs [b4-gfsib02a] : Type of crawl: INDEX
>>>> homegfs [b4-gfsib02a] : No. of entries healed        : 0
>>>> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0
>>>> homegfs [b4-gfsib02a] : No. of heal failed entries   : 0
>>>>
>>>> homegfs [b5-gfsib02b] : Starting time of crawl       : Thu Jan 21
>>>> 12:13:40 2016
>>>> homegfs [b5-gfsib02b] :                                *** Crawl is in
>>>> progress ***
>>>> homegfs [b5-gfsib02b] : Type of crawl: INDEX
>>>> homegfs [b5-gfsib02b] : No. of entries healed        : 0
>>>> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0
>>>> homegfs [b5-gfsib02b] : No. of heal failed entries   : 0
>>>>
>>>> homegfs [b6-gfsib02a] : Starting time of crawl       : Thu Jan 21
>>>> 12:36:58 2016
>>>> homegfs [b6-gfsib02a] : Ending time of crawl         : Thu Jan 21
>>>> 12:36:58 2016
>>>> homegfs [b6-gfsib02a] : Type of crawl: INDEX
>>>> homegfs [b6-gfsib02a] : No. of entries healed        : 0
>>>> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0
>>>> homegfs [b6-gfsib02a] : No. of heal failed entries   : 0
>>>>
>>>> homegfs [b7-gfsib02b] : Starting time of crawl       : Thu Jan 21
>>>> 12:36:50 2016
>>>> homegfs [b7-gfsib02b] : Ending time of crawl         : Thu Jan 21
>>>> 12:36:50 2016
>>>> homegfs [b7-gfsib02b] : Type of crawl: INDEX
>>>> homegfs [b7-gfsib02b] : No. of entries healed        : 0
>>>> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0
>>>> homegfs [b7-gfsib02b] : No. of heal failed entries   : 0
>>>>
>>>>
>>>>
>>>> ========================================================================================
>>>> I waited a few minutes for the heals to finish and ran the heal
>>>> statistics and info again. one file is in split-brain. Aside from the
>>>> split-brain, the load on all systems is down now and they are behaving
>>>> normally. glustershd.log is attached. What is going on???
>>>>
>>>> Thu Jan 21 12:53:50 EST 2016
>>>>
>>>> =================== homegfs ===================
>>>>
>>>> homegfs [b0-gfsib01a] : Starting time of crawl       : Thu Jan 21
>>>> 12:53:02 2016
>>>> homegfs [b0-gfsib01a] : Ending time of crawl         : Thu Jan 21
>>>> 12:53:02 2016
>>>> homegfs [b0-gfsib01a] : Type of crawl: INDEX
>>>> homegfs [b0-gfsib01a] : No. of entries healed        : 0
>>>> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0
>>>> homegfs [b0-gfsib01a] : No. of heal failed entries   : 0
>>>>
>>>> homegfs [b1-gfsib01b] : Starting time of crawl       : Thu Jan 21
>>>> 12:53:38 2016
>>>> homegfs [b1-gfsib01b] : Ending time of crawl         : Thu Jan 21
>>>> 12:53:38 2016
>>>> homegfs [b1-gfsib01b] : Type of crawl: INDEX
>>>> homegfs [b1-gfsib01b] : No. of entries healed        : 0
>>>> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0
>>>> homegfs [b1-gfsib01b] : No. of heal failed entries   : 1
>>>>
>>>> homegfs [b2-gfsib01a] : Starting time of crawl       : Thu Jan 21
>>>> 12:53:04 2016
>>>> homegfs [b2-gfsib01a] : Ending time of crawl         : Thu Jan 21
>>>> 12:53:04 2016
>>>> homegfs [b2-gfsib01a] : Type of crawl: INDEX
>>>> homegfs [b2-gfsib01a] : No. of entries healed        : 0
>>>> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0
>>>> homegfs [b2-gfsib01a] : No. of heal failed entries   : 0
>>>>
>>>> homegfs [b3-gfsib01b] : Starting time of crawl       : Thu Jan 21
>>>> 12:53:04 2016
>>>> homegfs [b3-gfsib01b] : Ending time of crawl         : Thu Jan 21
>>>> 12:53:04 2016
>>>> homegfs [b3-gfsib01b] : Type of crawl: INDEX
>>>> homegfs [b3-gfsib01b] : No. of entries healed        : 0
>>>> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0
>>>> homegfs [b3-gfsib01b] : No. of heal failed entries   : 0
>>>>
>>>> homegfs [b4-gfsib02a] : Starting time of crawl       : Thu Jan 21
>>>> 12:53:33 2016
>>>> homegfs [b4-gfsib02a] : Ending time of crawl         : Thu Jan 21
>>>> 12:53:33 2016
>>>> homegfs [b4-gfsib02a] : Type of crawl: INDEX
>>>> homegfs [b4-gfsib02a] : No. of entries healed        : 0
>>>> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0
>>>> homegfs [b4-gfsib02a] : No. of heal failed entries   : 1
>>>>
>>>> homegfs [b5-gfsib02b] : Starting time of crawl       : Thu Jan 21
>>>> 12:53:14 2016
>>>> homegfs [b5-gfsib02b] : Ending time of crawl         : Thu Jan 21
>>>> 12:53:15 2016
>>>> homegfs [b5-gfsib02b] : Type of crawl: INDEX
>>>> homegfs [b5-gfsib02b] : No. of entries healed        : 0
>>>> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0
>>>> homegfs [b5-gfsib02b] : No. of heal failed entries   : 3
>>>>
>>>> homegfs [b6-gfsib02a] : Starting time of crawl       : Thu Jan 21
>>>> 12:53:04 2016
>>>> homegfs [b6-gfsib02a] : Ending time of crawl         : Thu Jan 21
>>>> 12:53:04 2016
>>>> homegfs [b6-gfsib02a] : Type of crawl: INDEX
>>>> homegfs [b6-gfsib02a] : No. of entries healed        : 0
>>>> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0
>>>> homegfs [b6-gfsib02a] : No. of heal failed entries   : 0
>>>>
>>>> homegfs [b7-gfsib02b] : Starting time of crawl       : Thu Jan 21
>>>> 12:53:09 2016
>>>> homegfs [b7-gfsib02b] : Ending time of crawl         : Thu Jan 21
>>>> 12:53:09 2016
>>>> homegfs [b7-gfsib02b] : Type of crawl: INDEX
>>>> homegfs [b7-gfsib02b] : No. of entries healed        : 0
>>>> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0
>>>> homegfs [b7-gfsib02b] : No. of heal failed entries   : 0
>>>>
>>>> *** gluster bug in 'gluster volume heal homegfs statistics'   ***
>>>> *** Use 'gluster volume heal homegfs info' until bug is fixed ***
>>>>
>>>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
>>>> Number of entries: 0
>>>>
>>>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
>>>> Number of entries: 0
>>>>
>>>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
>>>> Number of entries: 0
>>>>
>>>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
>>>> Number of entries: 0
>>>>
>>>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
>>>> /users/bangell/.gconfd - Is in split-brain
>>>>
>>>> Number of entries: 1
>>>>
>>>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
>>>> /users/bangell/.gconfd - Is in split-brain
>>>>
>>>> /users/bangell/.gconfd/saved_state
>>>> Number of entries: 2
>>>>
>>>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
>>>> Number of entries: 0
>>>>
>>>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
>>>> Number of entries: 0
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jan 21, 2016 at 11:10 AM, Pranith Kumar Karampuri <
>>>> pkarampu at redhat.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 01/21/2016 09:26 PM, Glomski, Patrick wrote:
>>>>>
>>>>> I should mention that the problem is not currently occurring and there
>>>>> are no heals (output appended). By restarting the gluster services, we can
>>>>> stop the crawl, which lowers the load for a while. Subsequent crawls seem
>>>>> to finish properly. For what it's worth, files/folders that show up in the
>>>>> 'volume info' output during a hung crawl don't seem to be anything out of
>>>>> the ordinary.
>>>>>
>>>>> Over the past four days, the typical time before the problem recurs
>>>>> after suppressing it in this manner is an hour. Last night when we reached
>>>>> out to you was the last time it happened and the load has been low since (a
>>>>> relief).  David believes that recursively listing the files (ls -alR or
>>>>> similar) from a client mount can force the issue to happen, but obviously
>>>>> I'd rather not unless we have some precise thing we're looking for. Let me
>>>>> know if you'd like me to attempt to drive the system unstable like that and
>>>>> what I should look for. As it's a production system, I'd rather not leave
>>>>> it in this state for long.
>>>>>
>>>>>
>>>>> Will it be possible to send glustershd, mount logs of the past 4 days?
>>>>> I would like to see if this is because of directory self-heal going wild
>>>>> (Ravi is working on throttling feature for 3.8, which will allow to put
>>>>> breaks on self-heal traffic)
>>>>>
>>>>> Pranith
>>>>>
>>>>>
>>>>> [root at gfs01a xattrop]# gluster volume heal homegfs info
>>>>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
>>>>> Number of entries: 0
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 21, 2016 at 10:40 AM, Pranith Kumar Karampuri <
>>>>> pkarampu at redhat.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 01/21/2016 08:25 PM, Glomski, Patrick wrote:
>>>>>>
>>>>>> Hello, Pranith. The typical behavior is that the %cpu on a glusterfsd
>>>>>> process jumps to number of processor cores available (800% or 1200%,
>>>>>> depending on the pair of nodes involved) and the load average on the
>>>>>> machine goes very high (~20). The volume's heal statistics output shows
>>>>>> that it is crawling one of the bricks and trying to heal, but this crawl
>>>>>> hangs and never seems to finish.
>>>>>>
>>>>>>
>>>>>> The number of files in the xattrop directory varies over time, so I
>>>>>> ran a wc -l as you requested periodically for some time and then started
>>>>>> including a datestamped list of the files that were in the xattrops
>>>>>> directory on each brick to see which were persistent. All bricks had files
>>>>>> in the xattrop folder, so all results are attached.
>>>>>>
>>>>>> Thanks this info is helpful. I don't see a lot of files. Could you
>>>>>> give output of "gluster volume heal <volname> info"? Is there any directory
>>>>>> in there which is LARGE?
>>>>>>
>>>>>> Pranith
>>>>>>
>>>>>>
>>>>>> Please let me know if there is anything else I can provide.
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 21, 2016 at 12:01 AM, Pranith Kumar Karampuri <
>>>>>> pkarampu at redhat.com> wrote:
>>>>>>
>>>>>>> hey,
>>>>>>>        Which process is consuming so much cpu? I went through the
>>>>>>> logs you gave me. I see that the following files are in gfid mismatch state:
>>>>>>>
>>>>>>> <066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,
>>>>>>> <1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,
>>>>>>> <ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>,
>>>>>>>
>>>>>>> Could you give me the output of "ls <brick-path>/indices/xattrop |
>>>>>>> wc -l" output on all the bricks which are acting this way? This will tell
>>>>>>> us the number of pending self-heals on the system.
>>>>>>>
>>>>>>> Pranith
>>>>>>>
>>>>>>>
>>>>>>> On 01/20/2016 09:26 PM, David Robinson wrote:
>>>>>>>
>>>>>>> resending with parsed logs...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I am having issues with 3.6.6 where the load will spike up to 800%
>>>>>>> for one of the glusterfsd processes and the users can no longer access the
>>>>>>> system.  If I reboot the node, the heal will finish normally after a few
>>>>>>> minutes and the system will be responsive, but a few hours later the issue
>>>>>>> will start again.  It look like it is hanging in a heal and spinning up the
>>>>>>> load on one of the bricks.  The heal gets stuck and says it is crawling and
>>>>>>> never returns.  After a few minutes of the heal saying it is crawling, the
>>>>>>> load spikes up and the mounts become unresponsive.
>>>>>>>
>>>>>>> Any suggestions on how to fix this?  It has us stopped cold as the
>>>>>>> user can no longer access the systems when the load spikes... Logs attached.
>>>>>>>
>>>>>>> System setup info is:
>>>>>>>
>>>>>>> [root at gfs01a ~]# gluster volume info homegfs
>>>>>>>
>>>>>>> Volume Name: homegfs
>>>>>>> Type: Distributed-Replicate
>>>>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
>>>>>>> Status: Started
>>>>>>> Number of Bricks: 4 x 2 = 8
>>>>>>> Transport-type: tcp
>>>>>>> Bricks:
>>>>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>>>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>>>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>>>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>>>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
>>>>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
>>>>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
>>>>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
>>>>>>> Options Reconfigured:
>>>>>>> performance.io-thread-count: 32
>>>>>>> performance.cache-size: 128MB
>>>>>>> performance.write-behind-window-size: 128MB
>>>>>>> server.allow-insecure: on
>>>>>>> network.ping-timeout: 42
>>>>>>> storage.owner-gid: 100
>>>>>>> geo-replication.indexing: off
>>>>>>> geo-replication.ignore-pid-check: on
>>>>>>> changelog.changelog: off
>>>>>>> changelog.fsync-interval: 3
>>>>>>> changelog.rollover-time: 15
>>>>>>> server.manage-gids: on
>>>>>>> diagnostics.client-log-level: WARNING
>>>>>>>
>>>>>>> [root at gfs01a ~]# rpm -qa | grep gluster
>>>>>>> gluster-nagios-common-0.1.1-0.el6.noarch
>>>>>>> glusterfs-fuse-3.6.6-1.el6.x86_64
>>>>>>> glusterfs-debuginfo-3.6.6-1.el6.x86_64
>>>>>>> glusterfs-libs-3.6.6-1.el6.x86_64
>>>>>>> glusterfs-geo-replication-3.6.6-1.el6.x86_64
>>>>>>> glusterfs-api-3.6.6-1.el6.x86_64
>>>>>>> glusterfs-devel-3.6.6-1.el6.x86_64
>>>>>>> glusterfs-api-devel-3.6.6-1.el6.x86_64
>>>>>>> glusterfs-3.6.6-1.el6.x86_64
>>>>>>> glusterfs-cli-3.6.6-1.el6.x86_64
>>>>>>> glusterfs-rdma-3.6.6-1.el6.x86_64
>>>>>>> samba-vfs-glusterfs-4.1.11-2.el6.x86_64
>>>>>>> glusterfs-server-3.6.6-1.el6.x86_64
>>>>>>> glusterfs-extra-xlators-3.6.6-1.el6.x86_64
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Gluster-devel mailing listGluster-devel at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Gluster-users mailing list
>>>>>>> Gluster-users at gluster.org
>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160121/8fa6246f/attachment-0001.html>


More information about the Gluster-devel mailing list