[Gluster-devel] heal hanging

Thu Jan 21 05:01:54 UTC 2016

hey,
        Which process is consuming so much cpu? I went through the logs 
you gave me. I see that the following files are in gfid mismatch state:

<066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,
<1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,
<ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>,

Could you give me the output of "ls <brick-path>/indices/xattrop | wc 
-l" output on all the bricks which are acting this way? This will tell 
us the number of pending self-heals on the system.

Pranith

On 01/20/2016 09:26 PM, David Robinson wrote:
> resending with parsed logs...
>>> I am having issues with 3.6.6 where the load will spike up to 800% 
>>> for one of the glusterfsd processes and the users can no longer 
>>> access the system.  If I reboot the node, the heal will finish 
>>> normally after a few minutes and the system will be responsive, 
>>> but a few hours later the issue will start again.  It look like it 
>>> is hanging in a heal and spinning up the load on one of the bricks.  
>>> The heal gets stuck and says it is crawling and never returns.  
>>> After a few minutes of the heal saying it is crawling, the load 
>>> spikes up and the mounts become unresponsive.
>>> Any suggestions on how to fix this?  It has us stopped cold as the 
>>> user can no longer access the systems when the load spikes... Logs 
>>> attached.
>>> System setup info is:
>>> [root at gfs01a ~]# gluster volume info homegfs
>>>
>>> Volume Name: homegfs
>>> Type: Distributed-Replicate
>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
>>> Status: Started
>>> Number of Bricks: 4 x 2 = 8
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
>>> Options Reconfigured:
>>> performance.io-thread-count: 32
>>> performance.cache-size: 128MB
>>> performance.write-behind-window-size: 128MB
>>> server.allow-insecure: on
>>> network.ping-timeout: 42
>>> storage.owner-gid: 100
>>> geo-replication.indexing: off
>>> geo-replication.ignore-pid-check: on
>>> changelog.changelog: off
>>> changelog.fsync-interval: 3
>>> changelog.rollover-time: 15
>>> server.manage-gids: on
>>> diagnostics.client-log-level: WARNING
>>> [root at gfs01a ~]# rpm -qa | grep gluster
>>> gluster-nagios-common-0.1.1-0.el6.noarch
>>> glusterfs-fuse-3.6.6-1.el6.x86_64
>>> glusterfs-debuginfo-3.6.6-1.el6.x86_64
>>> glusterfs-libs-3.6.6-1.el6.x86_64
>>> glusterfs-geo-replication-3.6.6-1.el6.x86_64
>>> glusterfs-api-3.6.6-1.el6.x86_64
>>> glusterfs-devel-3.6.6-1.el6.x86_64
>>> glusterfs-api-devel-3.6.6-1.el6.x86_64
>>> glusterfs-3.6.6-1.el6.x86_64
>>> glusterfs-cli-3.6.6-1.el6.x86_64
>>> glusterfs-rdma-3.6.6-1.el6.x86_64
>>> samba-vfs-glusterfs-4.1.11-2.el6.x86_64
>>> glusterfs-server-3.6.6-1.el6.x86_64
>>> glusterfs-extra-xlators-3.6.6-1.el6.x86_64
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160121/45a3f63e/attachment-0001.html>