[Bugs] [Bug 1341069] New: [geo-rep]: Monitor crashed with [Errno 3] No such process

Tue May 31 08:11:56 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1341069

            Bug ID: 1341069
           Summary: [geo-rep]: Monitor crashed with [Errno 3] No such
                    process
           Product: GlusterFS
           Version: 3.8.0
         Component: geo-replication
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: avishwan at redhat.com
                CC: bugs at gluster.org, csaba at redhat.com,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1339163, 1339472
            Blocks: 1341068

+++ This bug was initially created as a clone of Bug #1339472 +++

+++ This bug was initially created as a clone of Bug #1339163 +++

Description of problem:
=======================

While Monitor was aborting the worker, it crashed as:

[2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor:
Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0)
[2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception]
<top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in
twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon
    slave_host, master)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in
monitor
    os.kill(cpid, signal.SIGKILL)
OSError: [Errno 3] No such process

In ideal scenario monitor process should never go down. If worker dies it kills
agent and monitor restarts both. If agent dies, then monitor kills worker and
restarts both. 

Whereas in this case, the agent died and monitor tried to abort worker where it
crashed. 

Georep session will remain in stopped state until restarted again. 

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.7.9-5.el7rhgs.x86_64
glusterfs-3.7.9-5.el7rhgs.x86_64

How reproducible:
=================
Happened to see this once during automated regression test suite. 

Steps to Reproduce:
===================
Will work on the steps and update BZ. In general the scenario would be:

=> Kill agent and monitor logs, where monitor tries to abort worker.

--- Additional comment from Vijay Bellur on 2016-05-25 02:47:16 EDT ---

REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully
if worker already died) posted (#2) for review on master by Aravinda VK
(avishwan at redhat.com)

--- Additional comment from Vijay Bellur on 2016-05-27 03:09:55 EDT ---

REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully
if worker already died) posted (#3) for review on master by Aravinda VK
(avishwan at redhat.com)

--- Additional comment from Vijay Bellur on 2016-05-30 03:15:47 EDT ---

REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully
if worker already died) posted (#4) for review on master by Aravinda VK
(avishwan at redhat.com)

--- Additional comment from Vijay Bellur on 2016-05-30 06:24:25 EDT ---

REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully
if worker already died) posted (#5) for review on master by Aravinda VK
(avishwan at redhat.com)

--- Additional comment from Vijay Bellur on 2016-05-30 11:12:08 EDT ---

COMMIT: http://review.gluster.org/14512 committed in master by Aravinda VK
(avishwan at redhat.com) 
------
commit 4f4a94a35a24d781f3f0e584a8cb59c019e50d6f
Author: Aravinda VK <avishwan at redhat.com>
Date:   Tue May 24 14:13:29 2016 +0530

    geo-rep: Handle Worker kill gracefully if worker already died

    If Agent dies for any reason, monitor tries to kill Worker also. But
    if worker is also died then kill command raises error ESRCH: No such
    process.

    [2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor:
        Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0)
    [2016-05-23 16:49:33.904535] E
[syncdutils(monitor):276:log_raise_exception]
        <top>: FAIL:
    Traceback (most recent call last):
      File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306
in
      twrap
        tf(*aa)
      File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in
      wmon
         slave_host, master)
      File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in
      monitor
         os.kill(cpid, signal.SIGKILL)
         OSError: [Errno 3] No such process

    With this patch, monitor will gracefully handle if worker is already died.

    Change-Id: I3ae5f816a3a197343b64540cf46f5453167fb660
    Signed-off-by: Aravinda VK <avishwan at redhat.com>
    BUG: 1339472
    Reviewed-on: http://review.gluster.org/14512
    Smoke: Gluster Build System <jenkins at build.gluster.com>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    Reviewed-by: Kotresh HR <khiremat at redhat.com>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1339163
[Bug 1339163] [geo-rep]: Monitor crashed with [Errno 3] No such process
https://bugzilla.redhat.com/show_bug.cgi?id=1339472
[Bug 1339472] [geo-rep]: Monitor crashed with [Errno 3] No such process
https://bugzilla.redhat.com/show_bug.cgi?id=1341068
[Bug 1341068] [geo-rep]: Monitor crashed with [Errno 3] No such process
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.