[Gluster-users] GlusterFS cluster stalls if one server from the cluster goes down and then comes back up

Thu Mar 24 13:11:39 UTC 2016

Apologies for the format beliw, I've been getting emails in digest format,
I had to c/p email below :

We have been observing same issue with  glusterfs-3.7.8-1.el7.x86_64, but
our investigation shows following :

- Even after SHD is turned off (self heal daemon) auto-heal process still
causes same amount of CPU activity
- This happens with NFS mounts as well, we've tried both FUSE and NFS v3
- It gets triggered when adding new  nodes/bricks to replicated (or
replicated+distributed) volumes
- It gets especially triggered when autoheal (or SHD as a matter of fact)
dives into a directory with 300K+ files in them.
- We've used strace/sdig and other methods, culprit seems like hard coded
SHD or AutoHeal threads. We've observed SHD uses 4 threads per brick, and
we couldn't find any config way to reduce thread counts.
- We've tried CPU pinning and others, this didn't solve our root cause, but
helped to reduce CPU load on our servers. Without CPU pinning, load goes as
high as 100 on 16 core/32CPU w/ HT servers.

Please let us know if this seems like a bug that needs to be filed.

Thank you

Hello,

On 03/23/2016 06:35 PM, Ravishankar N wrote:
>* On 03/23/2016 09:53 PM, Marian Marinov wrote:
*>>>* >What version of gluster is this?
*>>* 3.7.6
*>>>>>* >Do you observe the problem even when only the 4th 'non data' server
*>>>* comes up? In that case it is unlikely that self-heal is the issue.
*>>* No
*>>>>>* >Are the clients using FUSE or NFS mounts?
*>>* FUSE
*>>> >* Okay, when the you say the cluster stalls, I'm assuming the apps using
*>* files via the fuse mount are stalled. Does the mount log contain
*>* messages about completing selfheals on files when the mount eventually
*>* becomes responsive?If yes, you could try setting
*>* 'cluster.data-self-heal' to off.
*
Yes we have many lines with similar entries in the logs:

[2016-03-22 11:10:23.398668] I [MSGID: 108026]
[afr-self-heal-common.c:651:afr_log_selfheal] 0-share-replicate-0:
Completed data selfheal on b18c2b05-7186-4c22-ab34-24858b1153e5.
source=0 sinks=2

[2016-03-23 13:11:54.110773] I [MSGID: 108026]
[afr-self-heal-common.c:651:afr_log_selfheal] 0-share-replicate-0:
Completed metadata selfheal on 591d2bee-b55c-4dd6-a1bc-8b7fc5571caa.
source=0 sinks=

We already tested setting cluster.self-heal-daemon off and we did not
experience the issue in this case. We stopped one node, disabled
self-heal-daemon, started the node and later enabled the
self-heal-daemon. There was no "stalling" in this case.

We will try the suggested setting too.

-- 
Dimitar Ianakiev
System Administratorwww.siteground.com

-- 
Kayra Otaner
BilgiO A.Ş. -  SecOps Experts
PGP KeyID : A945251E | Manager, Enterprise Linux Solutions
www.bilgio.com |  TR +90 (532) 111-7240 x 1001 | US +1 (201) 206-2592
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160324/b7bf50b9/attachment.html>