[Bugs] [Bug 1732668] New: stale shd process files leading to heal timing out and heal deamon not coming up for all volumes
bugzilla at redhat.com
bugzilla at redhat.com
Wed Jul 24 05:52:51 UTC 2019
https://bugzilla.redhat.com/show_bug.cgi?id=1732668
Bug ID: 1732668
Summary: stale shd process files leading to heal timing out and
heal deamon not coming up for all volumes
Product: GlusterFS
Version: 7
Status: NEW
Component: replicate
Keywords: Regression, Reopened
Severity: high
Priority: high
Assignee: bugs at gluster.org
Reporter: rkavunga at redhat.com
CC: amukherj at redhat.com, atumball at redhat.com,
bugs at gluster.org, nchilaka at redhat.com,
rhs-bugs at redhat.com, rkavunga at redhat.com,
sankarshan at redhat.com, storage-qa-internal at redhat.com
Depends On: 1721802, 1722541
Target Milestone: ---
Classification: Community
+++ This bug was initially created as a clone of Bug #1722541 +++
+++ This bug was initially created as a clone of Bug #1721802 +++
Description of problem:
======================
Description of problem:
=======================
I have a 3 node brickmux enabled cluster
3 volumes exist as below
12x(6+2) ecvol named cvlt-ecv
2 1x3 afr vols, namely testvol and logvol
IOs are being done on cvlt-ecv volume(just DDs and appends)
Two of the nodes have been upgraded over past few days.
As part of upgrading the last node of a 3 node cluster to 6.0.5(including
kernel), I did a node reboot.
Post that the bricks were not coming up due to some bad entries in fstab and on
resolving them I also noticed that the cluster went to rejected state.
When check the cksums of the cvlt-ecv volume, I noticed a difference in the
cksum value b/w n3(node being upgraded) when compared to n1 and n2
Hence to fix that we deleted all the cvlt-ecv directory under /var/lib/glusterd
so that glusterd will heal them.
Did a restart of glusterd and the peer rejected issue was fixed.
However, we noticed that the shd was not showing online for the 2 afr volumes.
Tried to do restart of glusterd( including deleting glusterfsd,shd,fs procs)
But the shd is not coming up for the 2 afr volumes
based on the logs we noticed that the /var/run/gluster/testvol and logvol have
stale pid entries still existing and hence blocking the shd start on these
volumes
I went ahead and deleted the old stale pid files and shd came up on all the
volumes.
While I thought it was a one off thing, However I now see the same behavior in
another node too, which is quite concerning, as we see below problems
1) manual index heal command is timing out
2) heal deamon is not running on the other volumes(as stale pidfile exists in
/var/run/gluster)
--- Additional comment from Worker Ant on 2019-06-20 15:19:42 UTC ---
REVIEW: https://review.gluster.org/22909 (shd/mux: Fix race between mux_proc
unlink and stop) posted (#2) for review on master by mohammed rafi kc
--- Additional comment from Worker Ant on 2019-06-24 05:02:21 UTC ---
REVIEW: https://review.gluster.org/22909 (shd/mux: Fix race between mux_proc
unlink and stop) merged (#4) on master by Atin Mukherjee
--- Additional comment from Worker Ant on 2019-06-24 15:19:45 UTC ---
REVIEW: https://review.gluster.org/22935 (glusterd/svc: Fix race between shd
start and volume stop) posted (#1) for review on master by mohammed rafi kc
--- Additional comment from Worker Ant on 2019-07-09 12:19:37 UTC ---
REVIEW: https://review.gluster.org/22935 (glusterd/svc: update pid of mux
volumes from the shd process) merged (#17) on master by Atin Mukherjee
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1721802
[Bug 1721802] stale shd process files leading to heal timing out and heal
deamon not coming up for all volumes
https://bugzilla.redhat.com/show_bug.cgi?id=1722541
[Bug 1722541] stale shd process files leading to heal timing out and heal
deamon not coming up for all volumes
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list