[Bugs] [Bug 1536294] New: Random GlusterFSD process dies during rebalance

Fri Jan 19 04:27:04 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1536294

            Bug ID: 1536294
           Summary: Random GlusterFSD process dies during rebalance
           Product: GlusterFS
           Version: 3.13
         Component: core
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: jthottan at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    jmackey at getcruise.com, jthottan at redhat.com

+++ This bug was initially created as a clone of Bug #1533269 +++

Description of problem:

During a rebalance of a 252 brick volume, as the rebalance is scanning through
the initial directories, within 5-10 minutes, a seemingly random peer brick
process dies which stops the rebalance processes. The brick logs contain
healthy connection and disconnection up until the failure, where the brick
process throws a stack trace:

pending frames:
frame : type(0) op(36)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-01-10 21:13:21
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.4
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xaa)[0x7f7000635a5a]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x2e7)[0x7f700063f737]
/lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f6fffa284b0]
/lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4)[0x7f6fffdc6d44]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_rename_key+0x66)[0x7f7000630866]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/selinux.so(+0x1f15)[0x7f6ff87a9f15]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/marker.so(+0x11d77)[0x7f6ff8595d77]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/quota.so(+0xe02f)[0x7f6ff3de602f]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/debug/io-stats.so(+0x72d6)[0x7f6ff3bb12d6]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0x2e3be)[0x7f6ff37763be]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbd19)[0x7f6ff3753d19]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdb5)[0x7f6ff3753db5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc75c)[0x7f6ff375475c]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdfe)[0x7f6ff3753dfe]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc5fe)[0x7f6ff37545fe]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc712)[0x7f6ff3754712]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdde)[0x7f6ff3753dde]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc804)[0x7f6ff3754804]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0x276ce)[0x7f6ff376f6ce]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpcsvc_request_handler+0x96)[0x7f70003feca6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f6fffdc46ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6fffafa3dd]
---------

Anywhere from 1 to 5 brick processes on various hosts will all die at the same
time.

Version-Release number of selected component (if applicable):

3.12.4

How reproducible:

This happens consistently within the first 10 minutes of a rebalance

Steps to Reproduce:
- We had an existing gluster volume with about 2PB of data in it. Since our
existing gluster configs (3.7.20) were pretty old, we decided to bring down the
cluster and rebuild it fresh with the existing data. All gluster 3.7.20
libraries were purged, .glusterfs directory deleted from each brick and
glusterd 3.12.4 was installed. All 252 bricks were re-added to the cluster and
a fix-layout performed successfully. However, when a full rebalance is
initiated, eventually peer brick processes will crash.

Actual results:

Expected results:

Additional info:

--- Additional comment from Atin Mukherjee on 2018-01-11 03:46:53 EST ---

Jiffin, seeps to be crashing from selinux.c. Can you please check?

--- Additional comment from Jiffin on 2018-01-11 04:03:02 EST ---

Sure I will take a look

--- Additional comment from Worker Ant on 2018-01-17 23:02:01 EST ---

REVIEW: https://review.gluster.org/19220 (selinux-xlator : validate dict before
calling dict_rename_key()) posted (#1) for review on master by jiffin tony
Thottan

--- Additional comment from Jiffin on 2018-01-17 23:03:17 EST ---

>From the core it look like dict = NULL passed to fops handled by selinux xlator
which caused this error. A patch posted upstream 
https://review.gluster.org/19220 for review

--- Additional comment from Worker Ant on 2018-01-17 23:28:05 EST ---

REVISION POSTED: https://review.gluster.org/19220 (selinux-xlator : validate
dict before calling dict_rename_key()) posted (#2) for review on master by
jiffin tony Thottan

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.