[Bugs] [Bug 1533269] New: Random GlusterFSD process dies during rebalance

Wed Jan 10 22:38:28 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1533269

            Bug ID: 1533269
           Summary: Random GlusterFSD process dies during rebalance
           Product: GlusterFS
           Version: 3.12
         Component: glusterd
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: jmackey at getcruise.com
                CC: bugs at gluster.org

Description of problem:

During a rebalance of a 252 brick volume, as the rebalance is scanning through
the initial directories, within 5-10 minutes, a seemingly random peer brick
process dies which stops the rebalance processes. The brick logs contain
healthy connection and disconnection up until the failure, where the brick
process throws a stack trace:

pending frames:
frame : type(0) op(36)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-01-10 21:13:21
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.4
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xaa)[0x7f7000635a5a]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x2e7)[0x7f700063f737]
/lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f6fffa284b0]
/lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4)[0x7f6fffdc6d44]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_rename_key+0x66)[0x7f7000630866]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/selinux.so(+0x1f15)[0x7f6ff87a9f15]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/marker.so(+0x11d77)[0x7f6ff8595d77]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/quota.so(+0xe02f)[0x7f6ff3de602f]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/debug/io-stats.so(+0x72d6)[0x7f6ff3bb12d6]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0x2e3be)[0x7f6ff37763be]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbd19)[0x7f6ff3753d19]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdb5)[0x7f6ff3753db5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc75c)[0x7f6ff375475c]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdfe)[0x7f6ff3753dfe]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc5fe)[0x7f6ff37545fe]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc712)[0x7f6ff3754712]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdde)[0x7f6ff3753dde]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc804)[0x7f6ff3754804]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0x276ce)[0x7f6ff376f6ce]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpcsvc_request_handler+0x96)[0x7f70003feca6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f6fffdc46ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6fffafa3dd]
---------

Anywhere from 1 to 5 brick processes on various hosts will all die at the same
time.

Version-Release number of selected component (if applicable):

3.12.4

How reproducible:

This happens consistently within the first 10 minutes of a rebalance

Steps to Reproduce:
- We had an existing gluster volume with about 2PB of data in it. Since our
existing gluster configs (3.7.20) were pretty old, we decided to bring down the
cluster and rebuild it fresh with the existing data. All gluster 3.7.20
libraries were purged, .glusterfs directory deleted from each brick and
glusterd 3.12.4 was installed. All 252 bricks were re-added to the cluster and
a fix-layout performed successfully. However, when a full rebalance is
initiated, eventually peer brick processes will crash.

Actual results:

Expected results:

Additional info:

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.