[Bugs] [Bug 1727329] New: glustershd dumped core with seg fault at afr_has_quorum

Fri Jul 5 15:03:38 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1727329

            Bug ID: 1727329
           Summary: glustershd dumped core with seg fault at
                    afr_has_quorum
           Product: GlusterFS
           Version: mainline
            Status: NEW
         Component: replicate
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: rkavunga at redhat.com
                CC: amukherj at redhat.com, anepatel at redhat.com,
                    bugs at gluster.org, nchilaka at redhat.com,
                    rhs-bugs at redhat.com, rkavunga at redhat.com,
                    sankarshan at redhat.com, storage-qa-internal at redhat.com
        Depends On: 1725024
            Blocks: 1529501
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1725024 +++

Description of problem:
====================
hit this problem along with BZ#1725022  but on a different node and probably
different iteration

I was running volume creates and deletes on my brickmux setup, with creation of
different type of volumes. I saw after about 10 hrs,  shd crash dumps with
below BT

warning: core file may not match specified executable file.
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from
/usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.
Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/c8/fbb951579a5ccf45f786661b585545f43e4870
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id
shd/rep_vnpuvcukvzkhb -p /var/run'.
Program terminated with signal 11, Segmentation fault.
#0  afr_has_quorum (subvols=0x7f94eb2093d0 "", this=this at entry=0x7f94eb0bc070,
frame=frame at entry=0x0) at afr-transaction.c:871
871         up_children_count = AFR_COUNT(subvols, priv->child_count);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64
libcom_err-1.42.9-15.el7.x86_64 libgcc-4.8.5-39.el7.x86_64
libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-61.el7.x86_64
openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64
zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  afr_has_quorum (subvols=0x7f94eb2093d0 "", this=this at entry=0x7f94eb0bc070,
frame=frame at entry=0x0) at afr-transaction.c:871
#1  0x00007fa3bf5b82cd in afr_notify (this=0x7f94eb0bc070, event=6,
data=0x7f94eb0a3f30, data2=<optimized out>) at afr-common.c:5586
#2  0x00007fa3bf5b86c9 in notify (this=<optimized out>, event=<optimized out>,
data=<optimized out>) at afr.c:42
#3  0x00007fa3cdf682a2 in xlator_notify (xl=0x7f94eb0bc070,
event=event at entry=6, data=data at entry=0x7f94eb0a3f30) at xlator.c:692
#4  0x00007fa3ce01fd15 in default_notify (this=this at entry=0x7f94eb0a3f30,
event=event at entry=6, data=data at entry=0x0) at defaults.c:3388
#5  0x00007fa3bf7f2469 in client_notify_dispatch
(this=this at entry=0x7f94eb0a3f30, event=event at entry=6, data=data at entry=0x0) at
client.c:97
#6  0x00007fa3bf7f24ca in client_notify_dispatch_uniq
(this=this at entry=0x7f94eb0a3f30, event=event at entry=6, data=data at entry=0x0) at
client.c:71
#7  0x00007fa3bf7f348d in client_rpc_notify (rpc=0x7f94eb5d08b0,
mydata=0x7f94eb0a3f30, event=<optimized out>, data=<optimized out>) at
client.c:2365
#8  0x00007fa3cdd14203 in rpc_clnt_handle_disconnect (conn=0x7f94eb5d08e0,
clnt=0x7f94eb5d08b0) at rpc-clnt.c:826
#9  rpc_clnt_notify (trans=0x7f94eb5d0b70, mydata=0x7f94eb5d08e0,
event=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-clnt.c:887
#10 0x00007fa3cdd10a53 in rpc_transport_notify (this=this at entry=0x7f94eb5d0b70,
event=event at entry=RPC_TRANSPORT_DISCONNECT, data=data at entry=0x7f94eb5d0b70) at
rpc-transport.c:547
#11 0x00007fa3c232a2df in socket_event_poll_err
(this=this at entry=0x7f94eb5d0b70, gen=gen at entry=1, idx=idx at entry=4638) at
socket.c:1385
#12 0x00007fa3c232c6ea in socket_event_handler (fd=<optimized out>,
idx=<optimized out>, gen=<optimized out>, data=0x7f94eb5d0b70,
poll_in=<optimized out>, poll_out=<optimized out>, poll_err=16,
event_thread_died=0 '\000')
    at socket.c:3008
#13 0x00007fa3cdfd1416 in event_dispatch_epoll_handler (event=0x7fa3c0277e70,
event_pool=0x5587f336f5b0) at event-epoll.c:648
#14 event_dispatch_epoll_worker (data=0x5587f33c1fd0) at event-epoll.c:761
#15 0x00007fa3ccda9ea5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007fa3cc66f8cd in clone () from /lib64/libc.so.6

Version-Release number of selected component (if applicable):
===================
6.0.6 

How reproducible:
===============
was hit thrice on this cluster

Steps to Reproduce:
1.created a 3 node brickmux setup
2. triggered a script which creates 1000 volumes of different types
randomly(singlebrick,rep3, distrep3, arb,dist-arb,ecv,dist-ecv)
3. then post that we delete all the volumes
4. then again re-iterate step2,3

for j in {1..100};do echo "########## loop $j #### " |& tee -a volc.log; date
|& tee -a volc.log;for i in {1..1000};do python randvol-create.py |& tee -a
volc.log;done;for v in $(gluster v list);do gluster v stop $v --mode=script|&
tee -a volc.log;date |& tee -a volc.log; gluster v del $v --mode=script|& tee
-a volc.log;done;done

Actual results:
================
glustershd crash as above

--- Additional comment from Anees Patel on 2019-06-28 10:04:59 UTC ---

Same issue was hit twice on two different clusters on the latest 6.0.7 build.
Same bt was seen, also No brick-mux is enabled.

Steps performed:
1. Created 100 volumes
2. Stopped 90 volumes
3. Deleted 90 volumes

Unable to trigger heal as shd is crased.

sos-reports and core can be found at
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/anees/shd-mux3/

--- Additional comment from Mohammed Rafi KC on 2019-06-28 14:10:15 UTC ---

RCA:

This is a race condition with graph cleanup and processing of a disconnect.
This happens because of a bug in the CHILD_DOWN_EVENT handling the
default_notify. When the CHILD_DOWN event is processed on the first xlator, ie
debug-iostat in this case, we are marking that the graph is not usable. So then
graph cleanup will proceed to free the memory.

Now the race happens when all of the subvol is already down before initiating a
graph cleanup,  Which mean the graph->used is already set to 0. Now the cleanup
thread assume that the PARENT_DOWN is already completed because it sees that
the graph is no longer used, which will lead to freeing the xlator resources.
But,child down initiated by parent down might be still in the process.

I will update the fix, when fixing this, we also need to think of other
scenarios.

--- Additional comment from Mohammed Rafi KC on 2019-06-28 14:10:56 UTC ---

--- Additional comment from nchilaka on 2019-07-03 11:06:19 UTC ---

another crash hit with similar bt, when trying to add-bricks to randomly to any
one of the 3 different volumes{1 each of disp,rep3,arb} ...tried this for 20
times...
warning: core file may not match specified executable file.
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from
/usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.
Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/c8/fbb951579a5ccf45f786661b585545f43e4870
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id shd/arb -p
/var/run/gluster/shd/a'.
Program terminated with signal 11, Segmentation fault.
#0  __get_heard_from_all_status (this=this at entry=0x7fb534263680) at
afr-common.c:5024
5024        for (i = 0; i < priv->child_count; i++) {
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64
libcom_err-1.42.9-15.el7.x86_64 libgcc-4.8.5-39.el7.x86_64
libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-61.el7.x86_64
openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64
zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  __get_heard_from_all_status (this=this at entry=0x7fb534263680) at
afr-common.c:5024
#1  0x00007fb5447b6e27 in afr_notify (this=0x7fb534263680, event=6,
data=0x7fb53425c080, data2=<optimized out>)
    at afr-common.c:5519
#2  0x00007fb5447b76c9 in notify (this=<optimized out>, event=<optimized out>,
data=<optimized out>) at afr.c:42
#3  0x00007fb553167272 in xlator_notify (xl=0x7fb534263680,
event=event at entry=6, data=data at entry=0x7fb53425c080)
    at xlator.c:699
#4  0x00007fb55321e1d5 in default_notify (this=this at entry=0x7fb53425c080,
event=event at entry=6, data=data at entry=0x0)
    at defaults.c:3388
#5  0x00007fb5449f1469 in client_notify_dispatch
(this=this at entry=0x7fb53425c080, event=event at entry=6, 
    data=data at entry=0x0) at client.c:97
#6  0x00007fb5449f14ca in client_notify_dispatch_uniq
(this=this at entry=0x7fb53425c080, event=event at entry=6, 
    data=data at entry=0x0) at client.c:71
#7  0x00007fb5449f2465 in client_rpc_notify (rpc=0x7fb5343ecb30,
mydata=0x7fb53425c080, event=<optimized out>, 
    data=<optimized out>) at client.c:2373
#8  0x00007fb552f13203 in rpc_clnt_handle_disconnect (conn=0x7fb5343ecb60,
clnt=0x7fb5343ecb30) at rpc-clnt.c:826
#9  rpc_clnt_notify (trans=0x7fb5343ecdf0, mydata=0x7fb5343ecb60,
event=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-clnt.c:887
#10 0x00007fb552f0fa53 in rpc_transport_notify (this=this at entry=0x7fb5343ecdf0, 
    event=event at entry=RPC_TRANSPORT_DISCONNECT, data=data at entry=0x7fb5343ecdf0)
at rpc-transport.c:547
#11 0x00007fb5475292df in socket_event_poll_err
(this=this at entry=0x7fb5343ecdf0, gen=gen at entry=1, idx=idx at entry=18)
    at socket.c:1385
#12 0x00007fb54752b6ea in socket_event_handler (fd=<optimized out>,
idx=<optimized out>, gen=<optimized out>, 
    data=0x7fb5343ecdf0, poll_in=<optimized out>, poll_out=<optimized out>,
poll_err=16, event_thread_died=0 '\000')
    at socket.c:3008
#13 0x00007fb5531cf8d6 in event_dispatch_epoll_handler (event=0x7fb545476130,
event_pool=0x56031562e560)
---Type <return> to continue, or q <return> to quit---
    at event-epoll.c:648
#14 event_dispatch_epoll_worker (data=0x56031567eaa0) at event-epoll.c:761
#15 0x00007fb551fa8ea5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007fb55186e8cd in clone () from /lib64/libc.so.6
(gdb)

--- Additional comment from nchilaka on 2019-07-03 11:08:03 UTC ---

pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-07-03 11:00:11
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.0
/lib64/libglusterfs.so.0(+0x27210)[0x7fb55316a210]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fb553174c34]
/lib64/libc.so.6(+0x363f0)[0x7fb5517a63f0]
/usr/lib64/glusterfs/6.0/xlator/cluster/replicate.so(+0x5e447)[0x7fb5447a3447]
/usr/lib64/glusterfs/6.0/xlator/cluster/replicate.so(+0x71e27)[0x7fb5447b6e27]
/usr/lib64/glusterfs/6.0/xlator/cluster/replicate.so(+0x726c9)[0x7fb5447b76c9]
/lib64/libglusterfs.so.0(xlator_notify+0x32)[0x7fb553167272]
/lib64/libglusterfs.so.0(default_notify+0x195)[0x7fb55321e1d5]
/usr/lib64/glusterfs/6.0/xlator/protocol/client.so(+0x14469)[0x7fb5449f1469]
/usr/lib64/glusterfs/6.0/xlator/protocol/client.so(+0x144ca)[0x7fb5449f14ca]
/usr/lib64/glusterfs/6.0/xlator/protocol/client.so(+0x15465)[0x7fb5449f2465]
/lib64/libgfrpc.so.0(+0xf203)[0x7fb552f13203]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fb552f0fa53]
/usr/lib64/glusterfs/6.0/rpc-transport/socket.so(+0x82df)[0x7fb5475292df]
/usr/lib64/glusterfs/6.0/rpc-transport/socket.so(+0xa6ea)[0x7fb54752b6ea]
/lib64/libglusterfs.so.0(+0x8c8d6)[0x7fb5531cf8d6]
/lib64/libpthread.so.0(+0x7ea5)[0x7fb551fa8ea5]
/lib64/libc.so.6(clone+0x6d)[0x7fb55186e8cd]

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1529501
[Bug 1529501] [shd] : shd occupies ~7.8G in memory while toggling
cluster.self-heal-daemon in loop , possibly leaky.
https://bugzilla.redhat.com/show_bug.cgi?id=1725024
[Bug 1725024] glustershd dumped core with seg fault at afr_has_quorum
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.