[Gluster-devel] [Gluster-users] heal hanging

Thu Jan 21 16:10:25 UTC 2016

On 01/21/2016 09:26 PM, Glomski, Patrick wrote:
> I should mention that the problem is not currently occurring and there 
> are no heals (output appended). By restarting the gluster services, we 
> can stop the crawl, which lowers the load for a while. Subsequent 
> crawls seem to finish properly. For what it's worth, files/folders 
> that show up in the 'volume info' output during a hung crawl don't 
> seem to be anything out of the ordinary.
>
> Over the past four days, the typical time before the problem recurs 
> after suppressing it in this manner is an hour. Last night when we 
> reached out to you was the last time it happened and the load has been 
> low since (a relief).  David believes that recursively listing the 
> files (ls -alR or similar) from a client mount can force the issue to 
> happen, but obviously I'd rather not unless we have some precise thing 
> we're looking for. Let me know if you'd like me to attempt to drive 
> the system unstable like that and what I should look for. As it's a 
> production system, I'd rather not leave it in this state for long.

Will it be possible to send glustershd, mount logs of the past 4 days? I 
would like to see if this is because of directory self-heal going wild 
(Ravi is working on throttling feature for 3.8, which will allow to put 
breaks on self-heal traffic)

Pranith
>
> [root at gfs01a xattrop]# gluster volume heal homegfs info
> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
> Number of entries: 0
>
> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
> Number of entries: 0
>
> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
> Number of entries: 0
>
> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
> Number of entries: 0
>
> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
> Number of entries: 0
>
> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
> Number of entries: 0
>
> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
> Number of entries: 0
>
> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
> Number of entries: 0
>
>
>
>
> On Thu, Jan 21, 2016 at 10:40 AM, Pranith Kumar Karampuri 
> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>
>
>
>     On 01/21/2016 08:25 PM, Glomski, Patrick wrote:
>>     Hello, Pranith. The typical behavior is that the %cpu on a
>>     glusterfsd process jumps to number of processor cores available
>>     (800% or 1200%, depending on the pair of nodes involved) and the
>>     load average on the machine goes very high (~20). The volume's
>>     heal statistics output shows that it is crawling one of the
>>     bricks and trying to heal, but this crawl hangs and never seems
>>     to finish.
>>
>>     The number of files in the xattrop directory varies over time, so
>>     I ran a wc -l as you requested periodically for some time and
>>     then started including a datestamped list of the files that were
>>     in the xattrops directory on each brick to see which were
>>     persistent. All bricks had files in the xattrop folder, so all
>>     results are attached.
>     Thanks this info is helpful. I don't see a lot of files. Could you
>     give output of "gluster volume heal <volname> info"? Is there any
>     directory in there which is LARGE?
>
>     Pranith
>
>>
>>     Please let me know if there is anything else I can provide.
>>
>>     Patrick
>>
>>
>>     On Thu, Jan 21, 2016 at 12:01 AM, Pranith Kumar Karampuri
>>     <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>>
>>         hey,
>>                Which process is consuming so much cpu? I went through
>>         the logs you gave me. I see that the following files are in
>>         gfid mismatch state:
>>
>>         <066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,
>>         <1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,
>>         <ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>,
>>
>>         Could you give me the output of "ls
>>         <brick-path>/indices/xattrop | wc -l" output on all the
>>         bricks which are acting this way? This will tell us the
>>         number of pending self-heals on the system.
>>
>>         Pranith
>>
>>
>>         On 01/20/2016 09:26 PM, David Robinson wrote:
>>>         resending with parsed logs...
>>>>>         I am having issues with 3.6.6 where the load will spike up
>>>>>         to 800% for one of the glusterfsd processes and the users
>>>>>         can no longer access the system.  If I reboot the node,
>>>>>         the heal will finish normally after a few minutes and the
>>>>>         system will be responsive, but a few hours later the issue
>>>>>         will start again.  It look like it is hanging in a heal
>>>>>         and spinning up the load on one of the bricks.  The heal
>>>>>         gets stuck and says it is crawling and never returns.
>>>>>         After a few minutes of the heal saying it is crawling, the
>>>>>         load spikes up and the mounts become unresponsive.
>>>>>         Any suggestions on how to fix this?  It has us stopped
>>>>>         cold as the user can no longer access the systems when the
>>>>>         load spikes... Logs attached.
>>>>>         System setup info is:
>>>>>         [root at gfs01a ~]# gluster volume info homegfs
>>>>>
>>>>>         Volume Name: homegfs
>>>>>         Type: Distributed-Replicate
>>>>>         Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
>>>>>         Status: Started
>>>>>         Number of Bricks: 4 x 2 = 8
>>>>>         Transport-type: tcp
>>>>>         Bricks:
>>>>>         Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>>>>         Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>>>>         Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>>>>         Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>>>>         Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
>>>>>         Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
>>>>>         Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
>>>>>         Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
>>>>>         Options Reconfigured:
>>>>>         performance.io-thread-count: 32
>>>>>         performance.cache-size: 128MB
>>>>>         performance.write-behind-window-size: 128MB
>>>>>         server.allow-insecure: on
>>>>>         network.ping-timeout: 42
>>>>>         storage.owner-gid: 100
>>>>>         geo-replication.indexing: off
>>>>>         geo-replication.ignore-pid-check: on
>>>>>         changelog.changelog: off
>>>>>         changelog.fsync-interval: 3
>>>>>         changelog.rollover-time: 15
>>>>>         server.manage-gids: on
>>>>>         diagnostics.client-log-level: WARNING
>>>>>         [root at gfs01a ~]# rpm -qa | grep gluster
>>>>>         gluster-nagios-common-0.1.1-0.el6.noarch
>>>>>         glusterfs-fuse-3.6.6-1.el6.x86_64
>>>>>         glusterfs-debuginfo-3.6.6-1.el6.x86_64
>>>>>         glusterfs-libs-3.6.6-1.el6.x86_64
>>>>>         glusterfs-geo-replication-3.6.6-1.el6.x86_64
>>>>>         glusterfs-api-3.6.6-1.el6.x86_64
>>>>>         glusterfs-devel-3.6.6-1.el6.x86_64
>>>>>         glusterfs-api-devel-3.6.6-1.el6.x86_64
>>>>>         glusterfs-3.6.6-1.el6.x86_64
>>>>>         glusterfs-cli-3.6.6-1.el6.x86_64
>>>>>         glusterfs-rdma-3.6.6-1.el6.x86_64
>>>>>         samba-vfs-glusterfs-4.1.11-2.el6.x86_64
>>>>>         glusterfs-server-3.6.6-1.el6.x86_64
>>>>>         glusterfs-extra-xlators-3.6.6-1.el6.x86_64
>>>
>>>
>>>         _______________________________________________
>>>         Gluster-devel mailing list
>>>         Gluster-devel at gluster.org  <mailto:Gluster-devel at gluster.org>
>>>         http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>>
>>         _______________________________________________
>>         Gluster-users mailing list
>>         Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>         http://www.gluster.org/mailman/listinfo/gluster-users
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160121/dc6c2632/attachment.html>