[Bugs] [Bug 1491060] New: PID File handling: self-heal-deamon pid file leaves stale pid and indiscriminately kills pid when glusterd is started

Tue Sep 12 23:32:01 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1491060

            Bug ID: 1491060
           Summary: PID File handling: self-heal-deamon pid file leaves
                    stale pid and indiscriminately kills pid when glusterd
                    is started
           Product: GlusterFS
           Version: 3.10
         Component: glusterd
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: ben at apcera.com
                CC: bugs at gluster.org

Description of problem:

self-heal-deamon pid file leave stale pid and indiscriminately kills pid when
glusterd is started. pid files are stored in `/var/lib/glusterd` which persists
across reboots. When glusterd is started (or restarted or host rebooted) the
pid of any process matching the pid in the shd pid file is killed.

Version-Release number of selected component (if applicable):

3.10.4 from ppa:gluster/glusterfs-3.10

How reproducible:

1 to 1

Steps to Reproduce:
1. Create a volume. 
2. Enable Self-Heal Deamon
3. pid status
find /var/lib/glusterd/ -name '*pid' -exec tail -v {} \;
==> /var/lib/glusterd/glustershd/run/glustershd.pid <==
11642
==> /var/lib/glusterd/vols/vol0/run/172.28.128.5-data-brick0.pid <==
11169
4. killall -w glusterfs
5. create a process, background it, record the pid
sleep infinity & pid=$!
[1] 11669
6. put the pid of the process into the pid file
echo $pid >/ var/lib/glusterd/glustershd/run/glustershd.pid
7. confirm above
find /var/lib/glusterd/ -name '*pid' -exec tail -v {} \;
==> /var/lib/glusterd/glustershd/run/glustershd.pid <==
11669
==> /var/lib/glusterd/vols/vol0/run/172.28.128.5-data-brick0.pid <==
11169
8. restart glusterfs-server
service glusterfs-server restart
glusterfs-server stop/waiting
glusterfs-server start/running, process 11687
9. shell notifies that the background process was terminated
[1]+  Terminated              sleep infinity
10. shd starts, but kills a process other than glusterfs
gluster v status
Status of volume: vol0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 172.28.128.5:/data/brick0             49152     0          Y       11169
Brick 172.28.128.6:/data/brick0             49152     0          Y       11023
Self-heal Daemon on localhost               N/A       N/A        Y       12023
Self-heal Daemon on 172.28.128.6            N/A       N/A        Y       11044

Task Status of Volume vol0
------------------------------------------------------------------------------
There are no active volume tasks

Note: In some cases shd fails to start. 

Note2: In one case I saw the same pid listed for the brick and shd. In this
case the brick was terminated when shd started.
find /var/lib/glusterd/ -name '*pid' -exec tail -v {} \;
==> /var/lib/glusterd/vols/apcfs-default/run/172.27.0.19-data-brick0.pid <==
1468
==> /var/lib/glusterd/glustershd/run/glustershd.pid <==
1468

Actual results:
1. pid file /var/lib/glusterd/glustershd/run/glustershd.pid remains after shd
is stopped
2. glusterd kills any process number in the stale pid file.

Expected results:
1. shd pid file should be cleaned up
2. glusterd should only kill instances of glusterfs process

Additional info:
OS is Ubuntu Trusty

Workaround:

in our automation, when we stop all gluster processes (reboot, upgrade, etc.)
we ensure all processes are stopped and then cleanup the pids with 'find
/var/lib/glusterd/ -name '*pid' -delete'

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.