[Bugs] [Bug 1535772] New: Random GlusterFSD process dies during rebalance

Thu Jan 18 04:23:55 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1535772

            Bug ID: 1535772
           Summary: Random GlusterFSD process dies during rebalance
           Product: GlusterFS
           Version: mainline
         Component: core
          Severity: medium
          Assignee: jthottan at redhat.com
          Reporter: jthottan at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    jmackey at getcruise.com, jthottan at redhat.com
        Depends On: 1533269

+++ This bug was initially created as a clone of Bug #1533269 +++

Description of problem:

During a rebalance of a 252 brick volume, as the rebalance is scanning through
the initial directories, within 5-10 minutes, a seemingly random peer brick
process dies which stops the rebalance processes. The brick logs contain
healthy connection and disconnection up until the failure, where the brick
process throws a stack trace:

pending frames:
frame : type(0) op(36)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-01-10 21:13:21
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.4
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xaa)[0x7f7000635a5a]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x2e7)[0x7f700063f737]
/lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f6fffa284b0]
/lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4)[0x7f6fffdc6d44]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_rename_key+0x66)[0x7f7000630866]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/selinux.so(+0x1f15)[0x7f6ff87a9f15]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/marker.so(+0x11d77)[0x7f6ff8595d77]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/quota.so(+0xe02f)[0x7f6ff3de602f]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/debug/io-stats.so(+0x72d6)[0x7f6ff3bb12d6]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0x2e3be)[0x7f6ff37763be]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbd19)[0x7f6ff3753d19]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdb5)[0x7f6ff3753db5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc75c)[0x7f6ff375475c]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdfe)[0x7f6ff3753dfe]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc5fe)[0x7f6ff37545fe]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc712)[0x7f6ff3754712]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdde)[0x7f6ff3753dde]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc804)[0x7f6ff3754804]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0x276ce)[0x7f6ff376f6ce]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpcsvc_request_handler+0x96)[0x7f70003feca6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f6fffdc46ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6fffafa3dd]
---------

Anywhere from 1 to 5 brick processes on various hosts will all die at the same
time.

Version-Release number of selected component (if applicable):

3.12.4

How reproducible:

This happens consistently within the first 10 minutes of a rebalance

Steps to Reproduce:
- We had an existing gluster volume with about 2PB of data in it. Since our
existing gluster configs (3.7.20) were pretty old, we decided to bring down the
cluster and rebuild it fresh with the existing data. All gluster 3.7.20
libraries were purged, .glusterfs directory deleted from each brick and
glusterd 3.12.4 was installed. All 252 bricks were re-added to the cluster and
a fix-layout performed successfully. However, when a full rebalance is
initiated, eventually peer brick processes will crash.

Actual results:

Expected results:

Additional info:

--- Additional comment from Atin Mukherjee on 2018-01-11 03:46:53 EST ---

Jiffin, seeps to be crashing from selinux.c. Can you please check?

--- Additional comment from Jiffin on 2018-01-11 04:03:02 EST ---

Sure I will take a look

--- Additional comment from Worker Ant on 2018-01-17 23:02:01 EST ---

REVIEW: https://review.gluster.org/19220 (selinux-xlator : validate dict before
calling dict_rename_key()) posted (#1) for review on master by jiffin tony
Thottan

--- Additional comment from Jiffin on 2018-01-17 23:03:17 EST ---

>From the core it look like dict = NULL passed to fops handled by selinux xlator
which caused this error. A patch posted upstream 
https://review.gluster.org/19220 for review

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1533269
[Bug 1533269] Random GlusterFSD process dies during rebalance
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=vMsoffGwMO&a=cc_unsubscribe