[Bugs] [Bug 1215550] New: glusterfsd crashed after directory was removed from the mount point, while self-heal and rebalance were running on the volume

Mon Apr 27 05:43:43 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1215550

            Bug ID: 1215550
           Summary: glusterfsd crashed after directory was removed from
                    the mount point, while self-heal and rebalance were
                    running on the volume
           Product: GlusterFS
           Version: 3.7.0
         Component: core
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: ssampat at redhat.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com

Description of problem:
--------------------------

On a 6x3 volume, some bricks were brought down when rebalance was in progress.
This caused the mount to be read-only (client quorum was enabled). While
rebalance was in progress, the bricks were brought back up. This triggered
self-heal on the volume.

While self-heal was in progress, attempt to remove a directory was made from
the mount point. After a while a brick was found to have crashed.

Following is from logs of the brick that crashed -

[2015-04-22 12:31:23.720702] E [posix.c:4433:posix_removexattr] 0-vol-posix:
null gfid for path /linux-3.19.4/include/net/netprio_cgroup.h
frame : type(0) op(0)
frame : type(0) op(20)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(40)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-04-22 12:31:23
configuration details:
[2015-04-22 12:31:23.731428] E [posix.c:4433:posix_removexattr] 0-vol-posix:
null gfid for path /linux-3.19.4/include/net/raw.h
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7dev
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3ad30221c6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3ad303de2f]
/lib64/libc.so.6[0x3ad14326a0]
/lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x3ad180c380]
/usr/lib64/glusterfs/3.7dev/xlator/features/quota.so(+0x589b)[0x7f484d94689b]
/usr/lib64/glusterfs/3.7dev/xlator/features/quota.so(quota_fill_inodectx+0x1fa)[0x7f484d94f6aa]
/usr/lib64/glusterfs/3.7dev/xlator/features/quota.so(quota_readdirp_cbk+0x13e)[0x7f484d94fa9e]
/usr/lib64/glusterfs/3.7dev/xlator/features/marker.so(marker_readdirp_cbk+0x13e)[0x7f484db71bbe]
/usr/lib64/libglusterfs.so.0(default_readdirp_cbk+0xc2)[0x3ad302e622]
/usr/lib64/glusterfs/3.7dev/xlator/features/locks.so(pl_readdirp_cbk+0x18b)[0x7f484e5b6cfb]
/usr/lib64/glusterfs/3.7dev/xlator/features/access-control.so(posix_acl_readdirp_cbk+0x27a)[0x7f484e7d0b7a]
/usr/lib64/glusterfs/3.7dev/xlator/features/bitrot-stub.so(br_stub_readdirp_cbk+0x214)[0x7f484e9db304]
/usr/lib64/glusterfs/3.7dev/xlator/storage/posix.so(posix_do_readdir+0x1b8)[0x7f484f871498]
/usr/lib64/glusterfs/3.7dev/xlator/storage/posix.so(posix_readdirp+0x1ee)[0x7f484f872fde]
/usr/lib64/libglusterfs.so.0(default_readdirp+0x83)[0x3ad3027333]
/usr/lib64/libglusterfs.so.0(default_readdirp+0x83)[0x3ad3027333]
/usr/lib64/libglusterfs.so.0(default_readdirp+0x83)[0x3ad3027333]
/usr/lib64/glusterfs/3.7dev/xlator/features/bitrot-stub.so(br_stub_readdirp+0x259)[0x7f484e9d8e29]
/usr/lib64/glusterfs/3.7dev/xlator/features/access-control.so(posix_acl_readdirp+0x19d)[0x7f484e7cd4bd]
/usr/lib64/glusterfs/3.7dev/xlator/features/locks.so(pl_readdirp+0x204)[0x7f484e5b5d94]
/usr/lib64/libglusterfs.so.0(default_readdirp+0x83)[0x3ad3027333]
/usr/lib64/libglusterfs.so.0(default_readdirp_resume+0x142)[0x3ad3029db2]
/usr/lib64/libglusterfs.so.0(call_resume+0x80)[0x3ad3046470]
/usr/lib64/glusterfs/3.7dev/xlator/performance/io-threads.so(iot_worker+0x158)[0x7f484e1a1388]
/lib64/libpthread.so.0[0x3ad18079d1]
/lib64/libc.so.6(clone+0x6d)[0x3ad14e88fd]
---------

Attempt to remove a directory from the mount point failed -

# rm -fr linux-3.19.4
rm: cannot remove `linux-3.19.4/include/crypto': Directory not empty
rm: cannot remove `linux-3.19.4/include/drm': Directory not empty
rm: cannot remove `linux-3.19.4/include/media': Directory not empty
rm: cannot remove `linux-3.19.4/include/net/netfilter': Directory not empty
rm: cannot remove `linux-3.19.4/include/net/bluetooth': Directory not empty

I also see a lot of the following messages in brick logs -

<snip>

[2015-04-22 12:31:14.461836] E [posix.c:4433:posix_removexattr] 0-vol-posix:
null gfid for path /linux-3.19.4/include/net/netns
[2015-04-22 12:31:18.132176] E [posix.c:4433:posix_removexattr] 0-vol-posix:
null gfid for path /linux-3.19.4/include/net/caif/caif_layer.h
[2015-04-22 12:31:23.675448] E [posix.c:4433:posix_removexattr] 0-vol-posix:
null gfid for path /linux-3.19.4/include/net/ip6_fib.h
[2015-04-22 12:31:23.691089] E [posix.c:4433:posix_removexattr] 0-vol-posix:
null gfid for path /linux-3.19.4/include/net/ip_fib.h
[2015-04-22 12:31:23.699589] E [posix.c:4433:posix_removexattr] 0-vol-posix:
null gfid for path /linux-3.19.4/include/net/lib80211.h
[2015-04-22 12:31:23.711344] E [posix.c:4433:posix_removexattr] 0-vol-posix:
null gfid for path /linux-3.19.4/include/net/neighbour.h

</snip>

See volume info below -

# gluster volume info vol

Volume Name: vol
Type: Distributed-Replicate
Volume ID: 133fe4f3-987c-474d-9904-c28475d4812f
Status: Started
Number of Bricks: 6 x 3 = 18
Transport-type: tcp
Bricks:
Brick1: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick2: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick3: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick4: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick5: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1
Brick6: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1
Brick7: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1
Brick8: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1
Brick9: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1
Brick10: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1
Brick11: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1
Brick12: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1
Brick13: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick14: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick15: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick16: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick17: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick18: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick5/b1
Options Reconfigured:
cluster.quorum-type: auto
client.event-threads: 4
server.event-threads: 5
features.uss: enable
features.quota: on
cluster.consistent-metadata: on

Version-Release number of selected component (if applicable):
--------------------------------------------------------------

On the server - glusterfs-3.7dev-0.965.git2788ddd.el6.x86_64
On the client - glusterfs-3.7dev-0.1009.git8b987be.el6.x86_64

How reproducible:
------------------

Saw it once.

Steps to Reproduce:
--------------------

1. 1. On a 6x3 volume, started remove-brick operation of one replica set. 
2. After completion of data migration for the remove-brick operation, executed
stop remove-brick.
3. Started rebalance operation on the volume.
4. While rebalance was in progress, killed two bricks each in 3 replica sets. 
5. After a while, while rebalance was still running, started the volume using
force.
6. Was monitoring volume heal info output when I noticed that one of the bricks
was not connected (See BZ#1214169 for details)
7. While self-heal was in progress, tried to remove a directory from the mount
point -

#rm -fr linux-3.19.4

8. After a while a brick was found to have crashed (found brick to be
disconnected in heal info output)

Actual results:
----------------

Brick process crashed.

Expected results:
------------------

Brick process is not expected to crash.

Additional info:

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.