[Bugs] [Bug 1722541] New: stale shd process files leading to heal timing out and heal deamon not coming up for all volumes
bugzilla at redhat.com
bugzilla at redhat.com
Thu Jun 20 15:17:17 UTC 2019
https://bugzilla.redhat.com/show_bug.cgi?id=1722541
Bug ID: 1722541
Summary: stale shd process files leading to heal timing out and
heal deamon not coming up for all volumes
Product: GlusterFS
Version: mainline
Status: NEW
Component: replicate
Keywords: Regression
Severity: high
Assignee: bugs at gluster.org
Reporter: rkavunga at redhat.com
CC: amukherj at redhat.com, bugs at gluster.org,
nchilaka at redhat.com, rhs-bugs at redhat.com,
rkavunga at redhat.com, sankarshan at redhat.com,
storage-qa-internal at redhat.com
Depends On: 1721802
Target Milestone: ---
Classification: Community
+++ This bug was initially created as a clone of Bug #1721802 +++
Description of problem:
======================
Description of problem:
=======================
I have a 3 node brickmux enabled cluster
3 volumes exist as below
12x(6+2) ecvol named cvlt-ecv
2 1x3 afr vols, namely testvol and logvol
IOs are being done on cvlt-ecv volume(just DDs and appends)
Two of the nodes have been upgraded over past few days.
As part of upgrading the last node of a 3 node cluster to 6.0.5(including
kernel), I did a node reboot.
Post that the bricks were not coming up due to some bad entries in fstab and on
resolving them I also noticed that the cluster went to rejected state.
When check the cksums of the cvlt-ecv volume, I noticed a difference in the
cksum value b/w n3(node being upgraded) when compared to n1 and n2
Hence to fix that we deleted all the cvlt-ecv directory under /var/lib/glusterd
so that glusterd will heal them.
Did a restart of glusterd and the peer rejected issue was fixed.
However, we noticed that the shd was not showing online for the 2 afr volumes.
Tried to do restart of glusterd( including deleting glusterfsd,shd,fs procs)
But the shd is not coming up for the 2 afr volumes
based on the logs we noticed that the /var/run/gluster/testvol and logvol have
stale pid entries still existing and hence blocking the shd start on these
volumes
I went ahead and deleted the old stale pid files and shd came up on all the
volumes.
While I thought it was a one off thing, However I now see the same behavior in
another node too, which is quite concerning, as we see below problems
1) manual index heal command is timing out
2) heal deamon is not running on the other volumes(as stale pidfile exists in
/var/run/gluster)
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1721802
[Bug 1721802] stale shd process files leading to heal timing out and heal
deamon not coming up for all volumes
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list