[Bugs] [Bug 1500845] New: [geo-rep] master worker crash with interrupted system call
bugzilla at redhat.com
bugzilla at redhat.com
Wed Oct 11 15:12:23 UTC 2017
https://bugzilla.redhat.com/show_bug.cgi?id=1500845
Bug ID: 1500845
Summary: [geo-rep] master worker crash with interrupted system
call
Product: GlusterFS
Version: 3.12
Component: geo-replication
Assignee: bugs at gluster.org
Reporter: khiremat at redhat.com
CC: avishwan at redhat.com, bugs at gluster.org,
csaba at redhat.com, rallan at redhat.com,
rhinduja at redhat.com, rhs-bugs at redhat.com,
storage-qa-internal at redhat.com
Depends On: 1477087, 1499393
+++ This bug was initially created as a clone of Bug #1499393 +++
+++ This bug was initially created as a clone of Bug #1477087 +++
Description of problem:
=========================
Ran automated snapshot + geo-replication cases and observed master worker crash
with interrupted system call
[2017-07-31 17:32:22.560633] I
[master(/bricks/brick2/master_brick10):1132:crawl] _GMaster: slave's time:
(1501521813, 0)
[2017-07-31 17:32:22.668236] I
[master(/bricks/brick1/master_brick6):1132:crawl] _GMaster: slave's time:
(1501521812, 0)
[2017-07-31 17:32:23.242929] I [gsyncd(monitor):714:main_i] <top>: Monitor
Status: Paused
[2017-07-31 17:33:24.706393] I [gsyncd(monitor):714:main_i] <top>: Monitor
Status: Started
[2017-07-31 17:33:24.708093] E
[syncdutils(/bricks/brick1/master_brick6):296:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
main_i()
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, in
main_i
local.service_loop(*[r for r in [remote] if r])
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1582, in
service_loop
g2.crawlwrap()
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 570, in
crawlwrap
self.crawl()
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1143, in
crawl
self.changelogs_batch_process(changes)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1118, in
changelogs_batch_process
self.process(batch)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1001, in
process
self.process_change(change, done, retry)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 894, in
process_change
rl = errno_wrap(os.readlink, [en], [ENOENT], [ESTALE])
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in
errno_wrap
return call(*arg)
OSError: [Errno 4] Interrupted system call:
'.gfid/6858a52c-4d7d-4c06-889f-3c43e3a91e68/597f69a3%%SAZPTFV05C'
[2017-07-31 17:33:24.714128] I
[syncdutils(/bricks/brick1/master_brick6):237:finalize] <top>: exiting.
[2017-07-31 17:33:24.719995] I
[repce(/bricks/brick1/master_brick6):92:service_loop] RepceServer: terminating
on reaching EOF.
[2017-07-31 17:33:24.720269] I
[syncdutils(/bricks/brick1/master_brick6):237:finalize] <top>: exiting.
[2017-07-31 17:33:24.762686] I [gsyncdstatus(monitor):240:set_worker_status]
GeorepStatus: Worker Status: Faulty
[2017-07-31 17:33:27.880553] I
[master(/bricks/brick2/master_brick10):1132:crawl] _GMaster: slave's time:
(1501522338, 0)
client log suggests the following at the same time:
-----------------------------------------------------
[2017-07-31 16:53:11.296205] I [MSGID: 114035]
[client-handshake.c:201:client_set_lk_version_cbk] 0-master-client-9: Server lk
version = 1
[2017-07-31 16:53:11.319704] I [MSGID: 108031]
[afr-common.c:2264:afr_local_discovery_cbk] 0-master-replicate-3: selecting
local read_child master-client-6
[2017-07-31 16:53:11.319885] I [MSGID: 108031]
[afr-common.c:2264:afr_local_discovery_cbk] 0-master-replicate-1: selecting
local read_child master-client-2
[2017-07-31 16:53:11.320437] I [MSGID: 108031]
[afr-common.c:2264:afr_local_discovery_cbk] 0-master-replicate-5: selecting
local read_child master-client-10
[2017-07-31 17:33:24.751800] I [fuse-bridge.c:5092:fuse_thread_proc] 0-fuse:
initating unmount of /tmp/gsyncd-aux-mount-dW_b8o
[2017-07-31 17:33:24.752358] W [glusterfsd.c:1290:cleanup_and_exit]
(-->/lib64/libpthread.so.0(+0x3777a07aa1) [0x7fc5556f6aa1]
-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fc556b0a845]
-->/usr/sbin/glusterfs(cleanup_and_exit+0x76) [0x7fc556b0a2b6] ) 0-: received
signum (15), shutting down
[2017-07-31 17:33:24.752386] I [fuse-bridge.c:5827:fini] 0-fuse: Unmounting
'/tmp/gsyncd-aux-mount-dW_b8o'.
[2017-07-31 17:33:24.752400] I [fuse-bridge.c:5832:fini] 0-fuse: Closing fuse
connection to '/tmp/gsyncd-aux-mount-dW_b8o'.
[2017-07-31 17:33:39.461354] I [MSGID: 100030] [glusterfsd.c:2431:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args:
/usr/sbin/glusterfs --aux-gfid-mount --acl
--log-file=/var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.41.228%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Fbricks%2Fbrick1%2Fmaster_brick6.gluster.log
--volfile-server=localhost --volfile-id=master --client-pid=-1
/tmp/gsyncd-aux-mount-dz_HYe)
[2017-07-31 17:33:39.484689] I [MSGID: 101190]
[event-epoll.c:602:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 1
[2017-07-31 17:33:39.500307] I [MSGID: 101173]
[graph.c:269:gf_add_cmdline_options] 0-master-md-cache: adding option
'cache-posix-acl' for volume 'master-md-cache' with value 'true'
Version-Release number of selected component (if applicable):
==================================================================
mainline
How reproducible:
=================
Saw this only once so far.
Steps to Reproduce:
===================
Ran automated snapshot cases with a geo-replication setup
Actual results:
================
The worker crashed
Expected results:
==================
The worker should not crash
--- Additional comment from Worker Ant on 2017-10-06 23:52:18 EDT ---
REVIEW: https://review.gluster.org/18447 (geo-rep: Add EINTR to retry list
while doing readlink) posted (#1) for review on master by Kotresh HR
(khiremat at redhat.com)
--- Additional comment from Worker Ant on 2017-10-10 01:56:08 EDT ---
REVIEW: https://review.gluster.org/18447 (geo-rep: Add EINTR to retry list
while doing readlink) posted (#2) for review on master by Kotresh HR
(khiremat at redhat.com)
--- Additional comment from Worker Ant on 2017-10-11 06:16:56 EDT ---
COMMIT: https://review.gluster.org/18447 committed in master by Aravinda VK
(avishwan at redhat.com)
------
commit 34d52445a9058310d7512c9bcc8c01e709aac1ef
Author: Kotresh HR <khiremat at redhat.com>
Date: Fri Oct 6 23:45:49 2017 -0400
geo-rep: Add EINTR to retry list while doing readlink
Worker occasionally crashed with EINTR on readlink.
This is not persistent and is transient. Worker restart
invovles re-processing of few entries in changenlogs.
So adding EINTR to retry list to avoid worker restart.
Change-Id: Iefe641437b5d5be583f079fc2a7a8443bcd19f9d
BUG: 1499393
Signed-off-by: Kotresh HR <khiremat at redhat.com>
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1477087
[Bug 1477087] [geo-rep] master worker crash with interrupted system call
https://bugzilla.redhat.com/show_bug.cgi?id=1499393
[Bug 1499393] [geo-rep] master worker crash with interrupted system call
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list