[Bugs] [Bug 1399092] New: [geo-rep]: Worker crashes seen while renaming directories in loop

Mon Nov 28 09:35:14 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1399092

            Bug ID: 1399092
           Summary: [geo-rep]: Worker crashes seen while renaming
                    directories in loop
           Product: GlusterFS
           Version: 3.9
         Component: geo-replication
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: avishwan at redhat.com
                CC: bugs at gluster.org, csaba at redhat.com,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1385589, 1396062
            Blocks: 1399090

+++ This bug was initially created as a clone of Bug #1396062 +++

+++ This bug was initially created as a clone of Bug #1385589 +++

Description of problem:
=======================

While Testing the create and rename of directories in a loop, found multiple
crashes as follows:

[root at dhcp37-177 Master]# grep -ri "OSError: " *
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError:
[Errno 2] No such file or directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/957'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError:
[Errno 2] No such file or directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError:
[Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave2.log:OSError:
[Errno 2] No such file or directory:
'.gfid/00000000-0000-0000-0000-000000000001/nfs_dir.426'
ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave2.log:OSError:
[Errno 2] No such file or directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/22'
[root at dhcp37-177 Master]# 

Master:
=======
Crash 1: [Errno 2] No such file or directory:
=============================================

[2016-10-16 17:35:06.867371] E
[syncdutils(/rhs/brick2/b4):289:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 203, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 743, in
main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1532, in
service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in
crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in
crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in
changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in
process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in
process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
__call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
__call__
    raise res
OSError: [Errno 2] No such file or directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189'

Crash 2: [Errno 21] Is a directory
==================================

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 203, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 743, in
main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1532, in
service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in
crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in
crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in
changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in
process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in
process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
__call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
__call__
    raise res
OSError: [Errno 21] Is a directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823'

These crashes are propagated from slave as:
===========================================

[2016-10-16 17:31:06.800229] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in
entry_ops
    os.unlink(entry)
OSError: [Errno 2] No such file or directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/957'
[2016-10-16 17:31:06.825957] I [repce(slave):92:service_loop] RepceServer:
terminating on reaching EOF.
[2016-10-16 17:31:06.826287] I [syncdutils(slave):230:finalize] <top>: exiting.
[2016-10-16 17:31:18.37847] I [gsyncd(slave):733:main_i] <top>: syncing:
gluster://localhost:Slave1
[2016-10-16 17:31:23.391367] I [resource(slave):914:service_loop] GLUSTER:
slave listening
[2016-10-16 17:35:06.864521] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in
entry_ops
    os.unlink(entry)
OSError: [Errno 2] No such file or directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189'
[2016-10-16 17:35:06.884804] I [repce(slave):92:service_loop] RepceServer:
terminating on reaching EOF.
[2016-10-16 17:35:06.885364] I [syncdutils(slave):230:finalize] <top>: exiting.
[2016-10-16 17:35:17.967597] I [gsyncd(slave):733:main_i] <top>: syncing:
gluster://localhost:Slave1
[2016-10-16 17:35:23.303258] I [resource(slave):914:service_loop] GLUSTER:
slave listening
[2016-10-16 17:46:21.666467] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in
entry_ops
    os.unlink(entry)
OSError: [Errno 21] Is a directory:
'.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823'
[2016-10-16 17:46:21.687004] I [repce(slave):92:service_loop] RepceServer:
terminating on

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-server-3.8.4-2.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-2.el7rhgs.x86_64

How reproducible:
=================
Always

Steps to Reproduce:
===================
Seen this on non-root fanout setup, but should also see on normal setup.
Writing the exact steps as carried:

1. Create Master (2 nodes) and Slave Cluster (4 nodes)
2. Create and Start Master and 2 Slave Volumes (Each 2x2)
3. Create mount-broker geo-rep session between master and 2 slave volumes
4. Mount the Master and Slave Volume (NFS and Fuse)
5. Create dir on master and rename it.
for i in {1..1999}; do mkdir $i ; sleep 1 ; mv $i rs.$i ; done
for i in {1..1000}; do mv dir.$i rename_dir.$i; done
for i in {1..500}; do mkdir h.$i ; mv h.$i rsh.$i ; done

Actual results:
===============

Worker Crashes seen with Errno 2 and 21

Master:
=======

[root at dhcp37-58 ~]# gluster v info 

Volume Name: Master
Type: Distributed-Replicate
Volume ID: a4dc4c5c-95d7-4c71-ad52-3bbe70fc7240
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.58:/rhs/brick1/b1
Brick2: 10.70.37.177:/rhs/brick1/b2
Brick3: 10.70.37.58:/rhs/brick2/b3
Brick4: 10.70.37.177:/rhs/brick2/b4
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: off
cluster.enable-shared-storage: enable

Volume Name: gluster_shared_storage
Type: Replicate
Volume ID: cb7be148-8b85-43a2-837b-bb9d7de41a20
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.37.177:/var/lib/glusterd/ss_brick
Brick2: dhcp37-58.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
cluster.enable-shared-storage: enable
[root at dhcp37-58 ~]# 

Slave;
======

[root at dhcp37-214 ~]# gluster v info 

Volume Name: Slave1
Type: Distributed-Replicate
Volume ID: 928051ec-0177-4d13-b1cc-71d7783bfd95
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.214:/rhs/brick1/b1
Brick2: 10.70.37.62:/rhs/brick1/b2
Brick3: 10.70.37.214:/rhs/brick2/b3
Brick4: 10.70.37.62:/rhs/brick2/b4
Options Reconfigured:
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
cluster.enable-shared-storage: enable

Volume Name: Slave2
Type: Distributed-Replicate
Volume ID: 72c1006b-135f-4641-b2a1-a10a5a1ac12b
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.178:/rhs/brick1/b1
Brick2: 10.70.37.59:/rhs/brick1/b2
Brick3: 10.70.37.178:/rhs/brick2/b3
Brick4: 10.70.37.59:/rhs/brick2/b4
Options Reconfigured:
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
cluster.enable-shared-storage: enable
[root at dhcp37-214 ~]#

--- Additional comment from Aravinda VK on 2016-10-17 07:55:22 EDT ---

Crash 1: [Errno 2] No such file or directory:
This looks like two workers trying unlink at the same time. (As part of rename,
while Changelog reprocessing)

Solution: Handle the ENOENT and ESTALE errors during unlink.

Crash 2: [Errno 21] Is a directory:
"Is a Directory" issue is fixed in upstream
http://review.gluster.org/15132 

Related BZ https://bugzilla.redhat.com/show_bug.cgi?id=1365694#c3

--- Additional comment from Worker Ant on 2016-11-17 06:41:09 EST ---

REVIEW: http://review.gluster.org/15868 (geo-rep: Handle ENOENT during unlink)
posted (#1) for review on master by Aravinda VK (avishwan at redhat.com)

--- Additional comment from Worker Ant on 2016-11-22 14:33:41 EST ---

COMMIT: http://review.gluster.org/15868 committed in master by Vijay Bellur
(vbellur at redhat.com) 
------
commit ecd6da0a754f21909dbbd8189228f5a27a15df3e
Author: Aravinda VK <avishwan at redhat.com>
Date:   Thu Nov 17 17:07:36 2016 +0530

    geo-rep: Handle ENOENT during unlink

    Do not raise traceback if a file/dir not exists during
    unlink or rmdir

    BUG: 1396062
    Change-Id: Idd43ca1fa6ae6056c3cd493f0e2f151880a3968c
    Signed-off-by: Aravinda VK <avishwan at redhat.com>
    Reviewed-on: http://review.gluster.org/15868
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Vijay Bellur <vbellur at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1385589
[Bug 1385589] [geo-rep]: Worker crashes seen while renaming directories in
loop
https://bugzilla.redhat.com/show_bug.cgi?id=1396062
[Bug 1396062] [geo-rep]: Worker crashes seen while renaming directories in
loop
https://bugzilla.redhat.com/show_bug.cgi?id=1399090
[Bug 1399090] [geo-rep]: Worker crashes seen while renaming directories in
loop
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.