[Bugs] [Bug 1214169] New: glusterfsd crashed while rebalance and self-heal were in progress

Wed Apr 22 07:06:26 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1214169

            Bug ID: 1214169
           Summary: glusterfsd crashed while rebalance and self-heal were
                    in progress
           Product: GlusterFS
           Version: 3.7.0
         Component: core
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: ssampat at redhat.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com

Description of problem:
-----------------------

On a 6x3 volume, some bricks were brought down when rebalance was in progress.
This caused the mount to be read-only (client quorum was enabled). While
rebalance was in progress, the bricks were brought back up. While checking
self-heal info output, one brick was found to be not connected. 

This was not one of the bricks that was brought down.

Following is seen in brick logs -

pending frames:
frame : type(0) op(18)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-04-22 11:28:30
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7dev
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3d140221c6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3d1403de2f]
/lib64/libc.so.6[0x3d120326a0]
/usr/lib64/glusterfs/3.7dev/xlator/storage/posix.so(posix_getxattr+0xbd3)[0x7f558e324c03]
/usr/lib64/libglusterfs.so.0(default_getxattr+0x7b)[0x3d14027bab]
/usr/lib64/libglusterfs.so.0(default_getxattr+0x7b)[0x3d14027bab]
/usr/lib64/libglusterfs.so.0(default_getxattr+0x7b)[0x3d14027bab]
/usr/lib64/glusterfs/3.7dev/xlator/features/bitrot-stub.so(br_stub_getxattr+0x1e9)[0x7f558d6923a9]
/usr/lib64/glusterfs/3.7dev/xlator/features/access-control.so(posix_acl_getxattr+0x173)[0x7f558d48b9f3]
/usr/lib64/glusterfs/3.7dev/xlator/features/locks.so(pl_getxattr+0x1bb)[0x7f558d275d8b]
/usr/lib64/libglusterfs.so.0(default_getxattr+0x7b)[0x3d14027bab]
/usr/lib64/libglusterfs.so.0(default_getxattr_resume+0x13a)[0x3d1402b38a]
/usr/lib64/libglusterfs.so.0(call_resume+0x80)[0x3d14046470]
/usr/lib64/glusterfs/3.7dev/xlator/performance/io-threads.so(iot_worker+0x158)[0x7f558ce5a388]
/lib64/libpthread.so.0[0x3d124079d1]
/lib64/libc.so.6(clone+0x6d)[0x3d120e88fd]
---------

Following is the volume configuration -

# gluster volume info vol

Volume Name: vol
Type: Distributed-Replicate
Volume ID: 133fe4f3-987c-474d-9904-c28475d4812f
Status: Started
Number of Bricks: 6 x 3 = 18
Transport-type: tcp
Bricks:
Brick1: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick2: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick3: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick4: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1
Brick5: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1
Brick6: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1
Brick7: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1
Brick8: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1
Brick9: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1
Brick10: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1
Brick11: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1
Brick12: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1
Brick13: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick14: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick15: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick16: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick17: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick18: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick5/b1
Options Reconfigured:
cluster.quorum-type: auto
client.event-threads: 4
server.event-threads: 5
features.uss: enable
features.quota: on
cluster.consistent-metadata: on

Note that the client was on a different version of glusterfs than the server.

Version-Release number of selected component (if applicable):
---------------------------------------------------------------

On the server - glusterfs-3.7dev-0.965.git2788ddd.el6.x86_64
On the client - glusterfs-3.7dev-0.1009.git8b987be.el6.x86_64

How reproducible:
------------------

Saw this issue once.

Steps to Reproduce:
--------------------

1. On a 6x3 volume, started remove-brick operation of one replica set. 
2. After completion of data migration for the remove-brick operation, executed
stop remove-brick.
3. Started rebalance operation on the volume.
4. While rebalance was in progress, killed two bricks each in 3 replica sets. 
5. After a while, while rebalance was still running, started the volume using
force.
6. Was monitoring volume heal info output when I noticed that one of the bricks
was not connected.

Actual results:
----------------

Brick process crashed.

Expected results:
------------------

Brick process is not expected to crash.

Additional info:

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.