[Bugs] [Bug 1341069] New: [geo-rep]: Monitor crashed with [Errno 3] No such process
bugzilla at redhat.com
bugzilla at redhat.com
Tue May 31 08:11:56 UTC 2016
https://bugzilla.redhat.com/show_bug.cgi?id=1341069
Bug ID: 1341069
Summary: [geo-rep]: Monitor crashed with [Errno 3] No such
process
Product: GlusterFS
Version: 3.8.0
Component: geo-replication
Severity: urgent
Assignee: bugs at gluster.org
Reporter: avishwan at redhat.com
CC: bugs at gluster.org, csaba at redhat.com,
rhinduja at redhat.com, rhs-bugs at redhat.com,
storage-qa-internal at redhat.com
Depends On: 1339163, 1339472
Blocks: 1341068
+++ This bug was initially created as a clone of Bug #1339472 +++
+++ This bug was initially created as a clone of Bug #1339163 +++
Description of problem:
=======================
While Monitor was aborting the worker, it crashed as:
[2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor:
Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0)
[2016-05-23 16:49:33.904535] E [syncdutils(monitor):276:log_raise_exception]
<top>: FAIL:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306, in
twrap
tf(*aa)
File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in wmon
slave_host, master)
File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in
monitor
os.kill(cpid, signal.SIGKILL)
OSError: [Errno 3] No such process
In ideal scenario monitor process should never go down. If worker dies it kills
agent and monitor restarts both. If agent dies, then monitor kills worker and
restarts both.
Whereas in this case, the agent died and monitor tried to abort worker where it
crashed.
Georep session will remain in stopped state until restarted again.
Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.7.9-5.el7rhgs.x86_64
glusterfs-3.7.9-5.el7rhgs.x86_64
How reproducible:
=================
Happened to see this once during automated regression test suite.
Steps to Reproduce:
===================
Will work on the steps and update BZ. In general the scenario would be:
=> Kill agent and monitor logs, where monitor tries to abort worker.
--- Additional comment from Vijay Bellur on 2016-05-25 02:47:16 EDT ---
REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully
if worker already died) posted (#2) for review on master by Aravinda VK
(avishwan at redhat.com)
--- Additional comment from Vijay Bellur on 2016-05-27 03:09:55 EDT ---
REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully
if worker already died) posted (#3) for review on master by Aravinda VK
(avishwan at redhat.com)
--- Additional comment from Vijay Bellur on 2016-05-30 03:15:47 EDT ---
REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully
if worker already died) posted (#4) for review on master by Aravinda VK
(avishwan at redhat.com)
--- Additional comment from Vijay Bellur on 2016-05-30 06:24:25 EDT ---
REVIEW: http://review.gluster.org/14512 (geo-rep: Handle Worker kill gracefully
if worker already died) posted (#5) for review on master by Aravinda VK
(avishwan at redhat.com)
--- Additional comment from Vijay Bellur on 2016-05-30 11:12:08 EDT ---
COMMIT: http://review.gluster.org/14512 committed in master by Aravinda VK
(avishwan at redhat.com)
------
commit 4f4a94a35a24d781f3f0e584a8cb59c019e50d6f
Author: Aravinda VK <avishwan at redhat.com>
Date: Tue May 24 14:13:29 2016 +0530
geo-rep: Handle Worker kill gracefully if worker already died
If Agent dies for any reason, monitor tries to kill Worker also. But
if worker is also died then kill command raises error ESRCH: No such
process.
[2016-05-23 16:49:33.903965] I [monitor(monitor):326:monitor] Monitor:
Changelog Agent died, Aborting Worker(/bricks/brick0/master_brick0)
[2016-05-23 16:49:33.904535] E
[syncdutils(monitor):276:log_raise_exception]
<top>: FAIL:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 306
in
twrap
tf(*aa)
File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 393, in
wmon
slave_host, master)
File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 327, in
monitor
os.kill(cpid, signal.SIGKILL)
OSError: [Errno 3] No such process
With this patch, monitor will gracefully handle if worker is already died.
Change-Id: I3ae5f816a3a197343b64540cf46f5453167fb660
Signed-off-by: Aravinda VK <avishwan at redhat.com>
BUG: 1339472
Reviewed-on: http://review.gluster.org/14512
Smoke: Gluster Build System <jenkins at build.gluster.com>
NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
Reviewed-by: Kotresh HR <khiremat at redhat.com>
CentOS-regression: Gluster Build System <jenkins at build.gluster.com>
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1339163
[Bug 1339163] [geo-rep]: Monitor crashed with [Errno 3] No such process
https://bugzilla.redhat.com/show_bug.cgi?id=1339472
[Bug 1339472] [geo-rep]: Monitor crashed with [Errno 3] No such process
https://bugzilla.redhat.com/show_bug.cgi?id=1341068
[Bug 1341068] [geo-rep]: Monitor crashed with [Errno 3] No such process
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list