[Bugs] [Bug 1732668] New: stale shd process files leading to heal timing out and heal deamon not coming up for all volumes

Wed Jul 24 05:52:51 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1732668

            Bug ID: 1732668
           Summary: stale shd process files leading to heal timing out and
                    heal deamon not coming up for all volumes
           Product: GlusterFS
           Version: 7
            Status: NEW
         Component: replicate
          Keywords: Regression, Reopened
          Severity: high
          Priority: high
          Assignee: bugs at gluster.org
          Reporter: rkavunga at redhat.com
                CC: amukherj at redhat.com, atumball at redhat.com,
                    bugs at gluster.org, nchilaka at redhat.com,
                    rhs-bugs at redhat.com, rkavunga at redhat.com,
                    sankarshan at redhat.com, storage-qa-internal at redhat.com
        Depends On: 1721802, 1722541
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1722541 +++

+++ This bug was initially created as a clone of Bug #1721802 +++

Description of problem:
======================
Description of problem:
=======================
I have a 3 node brickmux enabled cluster
3 volumes exist as below
12x(6+2) ecvol named cvlt-ecv
2 1x3 afr vols, namely testvol and logvol

IOs are being done on cvlt-ecv volume(just DDs and appends)

Two of the nodes have been upgraded over past few days.
As part of upgrading the last node of a 3 node cluster to 6.0.5(including
kernel), I did a node reboot.
Post that the bricks were not coming up due to some bad entries in fstab and on
resolving them I also noticed that the cluster went to rejected state.
When check the cksums of the cvlt-ecv volume, I noticed a difference in the
cksum value b/w n3(node being upgraded) when compared to n1 and n2
Hence to fix that we deleted all the cvlt-ecv directory under /var/lib/glusterd
so that glusterd will heal them.
Did a restart of glusterd and the peer rejected issue was fixed.

However, we noticed that the shd was not showing online for the 2 afr volumes.

Tried to do restart of glusterd( including deleting glusterfsd,shd,fs procs)

But the shd is not coming up for the 2 afr volumes

based on the logs we noticed that the /var/run/gluster/testvol and logvol have
stale pid entries still existing and hence blocking the shd start on these
volumes

I went ahead and deleted the old stale pid files and shd came up on all the
volumes.

While I thought it was a one off thing, However I now see the same behavior in
another node too, which is quite concerning, as we see below problems
1) manual index heal command is timing out
2) heal deamon is not running on the other volumes(as stale pidfile  exists in
/var/run/gluster)

--- Additional comment from Worker Ant on 2019-06-20 15:19:42 UTC ---

REVIEW: https://review.gluster.org/22909 (shd/mux: Fix race between mux_proc
unlink and stop) posted (#2) for review on master by mohammed rafi  kc

--- Additional comment from Worker Ant on 2019-06-24 05:02:21 UTC ---

REVIEW: https://review.gluster.org/22909 (shd/mux: Fix race between mux_proc
unlink and stop) merged (#4) on master by Atin Mukherjee

--- Additional comment from Worker Ant on 2019-06-24 15:19:45 UTC ---

REVIEW: https://review.gluster.org/22935 (glusterd/svc: Fix race between shd
start and volume stop) posted (#1) for review on master by mohammed rafi  kc

--- Additional comment from Worker Ant on 2019-07-09 12:19:37 UTC ---

REVIEW: https://review.gluster.org/22935 (glusterd/svc: update pid of mux
volumes from the shd process) merged (#17) on master by Atin Mukherjee

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1721802
[Bug 1721802] stale shd process files leading to heal timing out and heal
deamon not coming up for all volumes
https://bugzilla.redhat.com/show_bug.cgi?id=1722541
[Bug 1722541] stale shd process files leading to heal timing out and heal
deamon not coming up for all volumes
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.