[Bugs] [Bug 1399090] New: [geo-rep]: Worker crashes seen while renaming directories in loop
bugzilla at redhat.com
bugzilla at redhat.com
Mon Nov 28 09:34:59 UTC 2016
https://bugzilla.redhat.com/show_bug.cgi?id=1399090
Bug ID: 1399090
Summary: [geo-rep]: Worker crashes seen while renaming
directories in loop
Product: GlusterFS
Version: 3.8
Component: geo-replication
Severity: high
Assignee: bugs at gluster.org
Reporter: avishwan at redhat.com
CC: bugs at gluster.org, csaba at redhat.com,
rhinduja at redhat.com, rhs-bugs at redhat.com,
storage-qa-internal at redhat.com
Depends On: 1385589, 1396062
+++ This bug was initially created as a clone of Bug #1396062 +++
+++ This bug was initially created as a clone of Bug #1385589 +++
Description of problem:
=======================
While Testing the create and rename of directories in a loop, found multiple
crashes as follows:
[root at dhcp37-177 Master]# grep -ri "OSError: " *
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError:
[Errno 2] No such file or directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/957'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError:
[Errno 2] No such file or directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError:
[Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave2.log:OSError:
[Errno 2] No such file or directory:
'.gfid/00000000-0000-0000-0000-000000000001/nfs_dir.426'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave2.log:OSError:
[Errno 2] No such file or directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/22'
[root at dhcp37-177 Master]#
Master:
=======
Crash 1: [Errno 2] No such file or directory:
=============================================
[2016-10-16 17:35:06.867371] E
[syncdutils(/rhs/brick2/b4):289:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 203, in main
main_i()
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 743, in
main_i
local.service_loop(*[r for r in [remote] if r])
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1532, in
service_loop
g2.crawlwrap()
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in
crawlwrap
self.crawl()
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in
crawl
self.changelogs_batch_process(changes)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in
changelogs_batch_process
self.process(batch)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in
process
self.process_change(change, done, retry)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in
process_change
failures = self.slave.server.entry_ops(entries)
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
__call__
return self.ins(self.meth, *a)
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
__call__
raise res
OSError: [Errno 2] No such file or directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189'
Crash 2: [Errno 21] Is a directory
==================================
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 203, in main
main_i()
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 743, in
main_i
local.service_loop(*[r for r in [remote] if r])
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1532, in
service_loop
g2.crawlwrap()
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in
crawlwrap
self.crawl()
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in
crawl
self.changelogs_batch_process(changes)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in
changelogs_batch_process
self.process(batch)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in
process
self.process_change(change, done, retry)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in
process_change
failures = self.slave.server.entry_ops(entries)
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
__call__
return self.ins(self.meth, *a)
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
__call__
raise res
OSError: [Errno 21] Is a directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823'
These crashes are propagated from slave as:
===========================================
[2016-10-16 17:31:06.800229] E [repce(slave):117:worker] <top>: call failed:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
res = getattr(self.obj, rmeth)(*in_data[2:])
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in
entry_ops
os.unlink(entry)
OSError: [Errno 2] No such file or directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/957'
[2016-10-16 17:31:06.825957] I [repce(slave):92:service_loop] RepceServer:
terminating on reaching EOF.
[2016-10-16 17:31:06.826287] I [syncdutils(slave):230:finalize] <top>: exiting.
[2016-10-16 17:31:18.37847] I [gsyncd(slave):733:main_i] <top>: syncing:
gluster://localhost:Slave1
[2016-10-16 17:31:23.391367] I [resource(slave):914:service_loop] GLUSTER:
slave listening
[2016-10-16 17:35:06.864521] E [repce(slave):117:worker] <top>: call failed:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
res = getattr(self.obj, rmeth)(*in_data[2:])
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in
entry_ops
os.unlink(entry)
OSError: [Errno 2] No such file or directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189'
[2016-10-16 17:35:06.884804] I [repce(slave):92:service_loop] RepceServer:
terminating on reaching EOF.
[2016-10-16 17:35:06.885364] I [syncdutils(slave):230:finalize] <top>: exiting.
[2016-10-16 17:35:17.967597] I [gsyncd(slave):733:main_i] <top>: syncing:
gluster://localhost:Slave1
[2016-10-16 17:35:23.303258] I [resource(slave):914:service_loop] GLUSTER:
slave listening
[2016-10-16 17:46:21.666467] E [repce(slave):117:worker] <top>: call failed:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
res = getattr(self.obj, rmeth)(*in_data[2:])
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in
entry_ops
os.unlink(entry)
OSError: [Errno 21] Is a directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823'
[2016-10-16 17:46:21.687004] I [repce(slave):92:service_loop] RepceServer:
terminating on
Version-Release number of selected component (if applicable):
=============================================================
glusterfs-server-3.8.4-2.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-2.el7rhgs.x86_64
How reproducible:
=================
Always
Steps to Reproduce:
===================
Seen this on non-root fanout setup, but should also see on normal setup.
Writing the exact steps as carried:
1. Create Master (2 nodes) and Slave Cluster (4 nodes)
2. Create and Start Master and 2 Slave Volumes (Each 2x2)
3. Create mount-broker geo-rep session between master and 2 slave volumes
4. Mount the Master and Slave Volume (NFS and Fuse)
5. Create dir on master and rename it.
for i in {1..1999}; do mkdir $i ; sleep 1 ; mv $i rs.$i ; done
for i in {1..1000}; do mv dir.$i rename_dir.$i; done
for i in {1..500}; do mkdir h.$i ; mv h.$i rsh.$i ; done
Actual results:
===============
Worker Crashes seen with Errno 2 and 21
Master:
=======
[root at dhcp37-58 ~]# gluster v info
Volume Name: Master
Type: Distributed-Replicate
Volume ID: a4dc4c5c-95d7-4c71-ad52-3bbe70fc7240
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.58:/rhs/brick1/b1
Brick2: 10.70.37.177:/rhs/brick1/b2
Brick3: 10.70.37.58:/rhs/brick2/b3
Brick4: 10.70.37.177:/rhs/brick2/b4
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: off
cluster.enable-shared-storage: enable
Volume Name: gluster_shared_storage
Type: Replicate
Volume ID: cb7be148-8b85-43a2-837b-bb9d7de41a20
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.37.177:/var/lib/glusterd/ss_brick
Brick2: dhcp37-58.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
cluster.enable-shared-storage: enable
[root at dhcp37-58 ~]#
Slave;
======
[root at dhcp37-214 ~]# gluster v info
Volume Name: Slave1
Type: Distributed-Replicate
Volume ID: 928051ec-0177-4d13-b1cc-71d7783bfd95
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.214:/rhs/brick1/b1
Brick2: 10.70.37.62:/rhs/brick1/b2
Brick3: 10.70.37.214:/rhs/brick2/b3
Brick4: 10.70.37.62:/rhs/brick2/b4
Options Reconfigured:
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
cluster.enable-shared-storage: enable
Volume Name: Slave2
Type: Distributed-Replicate
Volume ID: 72c1006b-135f-4641-b2a1-a10a5a1ac12b
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.178:/rhs/brick1/b1
Brick2: 10.70.37.59:/rhs/brick1/b2
Brick3: 10.70.37.178:/rhs/brick2/b3
Brick4: 10.70.37.59:/rhs/brick2/b4
Options Reconfigured:
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
cluster.enable-shared-storage: enable
[root at dhcp37-214 ~]#
--- Additional comment from Aravinda VK on 2016-10-17 07:55:22 EDT ---
Crash 1: [Errno 2] No such file or directory:
This looks like two workers trying unlink at the same time. (As part of rename,
while Changelog reprocessing)
Solution: Handle the ENOENT and ESTALE errors during unlink.
Crash 2: [Errno 21] Is a directory:
"Is a Directory" issue is fixed in upstream
http://review.gluster.org/15132
Related BZ https://bugzilla.redhat.com/show_bug.cgi?id=1365694#c3
--- Additional comment from Worker Ant on 2016-11-17 06:41:09 EST ---
REVIEW: http://review.gluster.org/15868 (geo-rep: Handle ENOENT during unlink)
posted (#1) for review on master by Aravinda VK (avishwan at redhat.com)
--- Additional comment from Worker Ant on 2016-11-22 14:33:41 EST ---
COMMIT: http://review.gluster.org/15868 committed in master by Vijay Bellur
(vbellur at redhat.com)
------
commit ecd6da0a754f21909dbbd8189228f5a27a15df3e
Author: Aravinda VK <avishwan at redhat.com>
Date: Thu Nov 17 17:07:36 2016 +0530
geo-rep: Handle ENOENT during unlink
Do not raise traceback if a file/dir not exists during
unlink or rmdir
BUG: 1396062
Change-Id: Idd43ca1fa6ae6056c3cd493f0e2f151880a3968c
Signed-off-by: Aravinda VK <avishwan at redhat.com>
Reviewed-on: http://review.gluster.org/15868
Smoke: Gluster Build System <jenkins at build.gluster.org>
NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
Reviewed-by: Vijay Bellur <vbellur at redhat.com>
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1385589
[Bug 1385589] [geo-rep]: Worker crashes seen while renaming directories in
loop
https://bugzilla.redhat.com/show_bug.cgi?id=1396062
[Bug 1396062] [geo-rep]: Worker crashes seen while renaming directories in
loop
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list