[Gluster-devel] [Gluster-users] heal hanging

Pranith Kumar Karampuri pkarampu at redhat.com
Thu Jan 21 15:40:21 UTC 2016



On 01/21/2016 08:25 PM, Glomski, Patrick wrote:
> Hello, Pranith. The typical behavior is that the %cpu on a glusterfsd 
> process jumps to number of processor cores available (800% or 1200%, 
> depending on the pair of nodes involved) and the load average on the 
> machine goes very high (~20). The volume's heal statistics output 
> shows that it is crawling one of the bricks and trying to heal, but 
> this crawl hangs and never seems to finish.
>
> The number of files in the xattrop directory varies over time, so I 
> ran a wc -l as you requested periodically for some time and then 
> started including a datestamped list of the files that were in the 
> xattrops directory on each brick to see which were persistent. All 
> bricks had files in the xattrop folder, so all results are attached.
Thanks this info is helpful. I don't see a lot of files. Could you give 
output of "gluster volume heal <volname> info"? Is there any directory 
in there which is LARGE?

Pranith
>
> Please let me know if there is anything else I can provide.
>
> Patrick
>
>
> On Thu, Jan 21, 2016 at 12:01 AM, Pranith Kumar Karampuri 
> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>
>     hey,
>            Which process is consuming so much cpu? I went through the
>     logs you gave me. I see that the following files are in gfid
>     mismatch state:
>
>     <066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,
>     <1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,
>     <ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>,
>
>     Could you give me the output of "ls <brick-path>/indices/xattrop |
>     wc -l" output on all the bricks which are acting this way? This
>     will tell us the number of pending self-heals on the system.
>
>     Pranith
>
>
>     On 01/20/2016 09:26 PM, David Robinson wrote:
>>     resending with parsed logs...
>>>>     I am having issues with 3.6.6 where the load will spike up to
>>>>     800% for one of the glusterfsd processes and the users can no
>>>>     longer access the system.  If I reboot the node, the heal will
>>>>     finish normally after a few minutes and the system will be
>>>>     responsive, but a few hours later the issue will start again. 
>>>>     It look like it is hanging in a heal and spinning up the load
>>>>     on one of the bricks.  The heal gets stuck and says it is
>>>>     crawling and never returns.  After a few minutes of the heal
>>>>     saying it is crawling, the load spikes up and the mounts become
>>>>     unresponsive.
>>>>     Any suggestions on how to fix this?  It has us stopped cold as
>>>>     the user can no longer access the systems when the load
>>>>     spikes... Logs attached.
>>>>     System setup info is:
>>>>     [root at gfs01a ~]# gluster volume info homegfs
>>>>
>>>>     Volume Name: homegfs
>>>>     Type: Distributed-Replicate
>>>>     Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
>>>>     Status: Started
>>>>     Number of Bricks: 4 x 2 = 8
>>>>     Transport-type: tcp
>>>>     Bricks:
>>>>     Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>>>     Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>>>     Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>>>     Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>>>     Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
>>>>     Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
>>>>     Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
>>>>     Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
>>>>     Options Reconfigured:
>>>>     performance.io-thread-count: 32
>>>>     performance.cache-size: 128MB
>>>>     performance.write-behind-window-size: 128MB
>>>>     server.allow-insecure: on
>>>>     network.ping-timeout: 42
>>>>     storage.owner-gid: 100
>>>>     geo-replication.indexing: off
>>>>     geo-replication.ignore-pid-check: on
>>>>     changelog.changelog: off
>>>>     changelog.fsync-interval: 3
>>>>     changelog.rollover-time: 15
>>>>     server.manage-gids: on
>>>>     diagnostics.client-log-level: WARNING
>>>>     [root at gfs01a ~]# rpm -qa | grep gluster
>>>>     gluster-nagios-common-0.1.1-0.el6.noarch
>>>>     glusterfs-fuse-3.6.6-1.el6.x86_64
>>>>     glusterfs-debuginfo-3.6.6-1.el6.x86_64
>>>>     glusterfs-libs-3.6.6-1.el6.x86_64
>>>>     glusterfs-geo-replication-3.6.6-1.el6.x86_64
>>>>     glusterfs-api-3.6.6-1.el6.x86_64
>>>>     glusterfs-devel-3.6.6-1.el6.x86_64
>>>>     glusterfs-api-devel-3.6.6-1.el6.x86_64
>>>>     glusterfs-3.6.6-1.el6.x86_64
>>>>     glusterfs-cli-3.6.6-1.el6.x86_64
>>>>     glusterfs-rdma-3.6.6-1.el6.x86_64
>>>>     samba-vfs-glusterfs-4.1.11-2.el6.x86_64
>>>>     glusterfs-server-3.6.6-1.el6.x86_64
>>>>     glusterfs-extra-xlators-3.6.6-1.el6.x86_64
>>
>>
>>     _______________________________________________
>>     Gluster-devel mailing list
>>     Gluster-devel at gluster.org  <mailto:Gluster-devel at gluster.org>
>>     http://www.gluster.org/mailman/listinfo/gluster-devel
>
>
>     _______________________________________________
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>     http://www.gluster.org/mailman/listinfo/gluster-users
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160121/3d60eee8/attachment.html>


More information about the Gluster-devel mailing list