[Bugs] [Bug 1808875] New: [brickmux]: glustershd crashed when rebooting 1/3 nodes at regular intervals

Sun Mar 1 19:00:05 UTC 2020

https://bugzilla.redhat.com/show_bug.cgi?id=1808875

            Bug ID: 1808875
           Summary: [brickmux]: glustershd crashed when rebooting 1/3
                    nodes at regular intervals
           Product: GlusterFS
           Version: mainline
            Status: NEW
         Component: replicate
          Keywords: ZStream
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: jahernan at redhat.com
                CC: bugs at gluster.org, moagrawa at redhat.com,
                    nchilaka at redhat.com, pkarampu at redhat.com,
                    pprakash at redhat.com, ravishankar at redhat.com,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    rkavunga at redhat.com, sheggodu at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1794663
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1794663 +++

Description of problem:
=====================
glustershd crashed on 2 nodes when testing the fix for BZ#1793035
more details are available @
https://bugzilla.redhat.com/show_bug.cgi?id=1793035#c21

Also, another issue:
I do see transport end point errors for some of the volumes intermittently when
we do the reboot of n2.
this is consistently seen for some set of volumes(all arbiter) and the issue
goes away as soon as n2 is up

eg:
[root at rhsqa7 glusterfs]# pwd
/var/log/glusterfs
[root at rhsqa7 glusterfs]# less mnt-arb_4yqv4qry9ogxx.log 
[root at rhsqa7 glusterfs]# cd /mnt/arb_4yqv4qry9ogxx 
-bash: cd: /mnt/arb_4yqv4qry9ogxx: Transport endpoint is not connected
[root at rhsqa7 glusterfs]# 
[root at rhsqa7 glusterfs]# vim mnt-arb_4yqv4qry9ogxx.log 
[root at rhsqa7 glusterfs]# cd /mnt/arb_4yqv4qry9ogxx 
[root at rhsqa7 arb_4yqv4qry9ogxx]# ls
kernel
[root at rhsqa7 arb_4yqv4qry9ogxx]# ls
ls: cannot open directory .: Transport endpoint is not connected
[root at rhsqa7 arb_4yqv4qry9ogxx]# 
[root at rhsqa7 arb_4yqv4qry9ogxx]# 
[root at rhsqa7 arb_4yqv4qry9ogxx]# ls
ls: cannot open directory .: Transport endpoint is not connected
[root at rhsqa7 arb_4yqv4qry9ogxx]# 
[root at rhsqa7 arb_4yqv4qry9ogxx]# 
[root at rhsqa7 arb_4yqv4qry9ogxx]# ls
ls: cannot open directory .: Transport endpoint is not connected
[root at rhsqa7 arb_4yqv4qry9ogxx]# 
[root at rhsqa7 arb_4yqv4qry9ogxx]# pwd
/mnt/arb_4yqv4qry9ogxx
[root at rhsqa7 arb_4yqv4qry9ogxx]# 
[root at rhsqa7 arb_4yqv4qry9ogxx]# 
[root at rhsqa7 arb_4yqv4qry9ogxx]# ls
kernel
[root at rhsqa7 arb_4yqv4qry9ogxx]# 

[root at rhsqa7 arb_4yqv4qry9ogxx]# cd .meta
[root at rhsqa7 .meta]# ls
cmdline  frames  graphs  logging  mallinfo  master  measure_latency 
process_uuid  version
[root at rhsqa7 .meta]# cd graphs/active/
[root at rhsqa7 active]# ls
arb_4yqv4qry9ogxx           arb_4yqv4qry9ogxx-dht         
arb_4yqv4qry9ogxx-replicate-0   volfile
arb_4yqv4qry9ogxx-client-0  arb_4yqv4qry9ogxx-md-cache    
arb_4yqv4qry9ogxx-write-behind
arb_4yqv4qry9ogxx-client-1  arb_4yqv4qry9ogxx-open-behind  meta-autoload
arb_4yqv4qry9ogxx-client-2  arb_4yqv4qry9ogxx-quick-read   top
[root at rhsqa7 active]# cat arb_4yqv4qry9ogxx-client-*/private |grep connect
connected = 1
connected = 0
connected = 1
[root at rhsqa7 active]# 

However we must not see that because always same node is brought down.

n1:
====
[root at rhsqa10 ~]# #gdb /core.10158 
[root at rhsqa10 ~]# 

warning: core file may not match specified executable file.
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from
/usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.
Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/b5/a5458535a1397fa6baaf5e8c13a6395426a1b2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id
gluster/glustershd -p /var/run/gl'.
Program terminated with signal 11, Segmentation fault.
#0  client4_0_lookup_cbk (req=0x7f5a38185988, iov=0x7f5a381859c0,
count=<optimized out>, myframe=0x7f5a38002ca8)
    at client-rpc-fops_v2.c:2576
2576        inode = local->loc.inode;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64
libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64
libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-61.el7_7.1.x86_64
openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64
zlib-1.2.7-18.el7.x86_64
(gdb) 
(gdb) bt
#0  client4_0_lookup_cbk (req=0x7f5a38185988, iov=0x7f5a381859c0,
count=<optimized out>, myframe=0x7f5a38002ca8)
    at client-rpc-fops_v2.c:2576
#1  0x00007f660b06d0f1 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7f65f3511830, pollin=pollin at entry=0x7f65f8218360)
    at rpc-clnt.c:764
#2  0x00007f660b06d457 in rpc_clnt_notify (trans=0x7f65f3511b60,
mydata=0x7f65f3511860, event=<optimized out>, 
    data=0x7f65f8218360) at rpc-clnt.c:931
#3  0x00007f660b069af3 in rpc_transport_notify (this=this at entry=0x7f65f3511b60, 
    event=event at entry=RPC_TRANSPORT_MSG_RECEIVED,
data=data at entry=0x7f65f8218360) at rpc-transport.c:547
#4  0x00007f65ff685af5 in socket_event_poll_in (notify_handled=true,
this=0x7f65f3511b60) at socket.c:2582
#5  socket_event_handler (fd=<optimized out>, idx=<optimized out>,
gen=<optimized out>, data=0x7f65f3511b60, 
    poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0,
event_thread_died=0 '\000') at socket.c:2991
#6  0x00007f660b328806 in event_dispatch_epoll_handler (event=0x7f65fddd1130,
event_pool=0x55998e37e570)
    at event-epoll.c:656
#7  event_dispatch_epoll_worker (data=0x55998e3ce150) at event-epoll.c:769
#8  0x00007f660a102ea5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f66099c88cd in clone () from /lib64/libc.so.6
(gdb) t a a bt

Thread 523 (Thread 0x7f65a67fc700 (LWP 10203)):
#0  0x00007f660a106de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007f65fc6a20ce in __afr_shd_healer_wait
(healer=healer at entry=0x7f65f154e3c0) at afr-self-heald.c:100
#2  0x00007f65fc6a214d in afr_shd_healer_wait
(healer=healer at entry=0x7f65f154e3c0) at afr-self-heald.c:121
#3  0x00007f65fc6a3c10 in afr_shd_index_healer (data=0x7f65f154e3c0) at
afr-self-heald.c:818
#4  0x00007f660a102ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f66099c88cd in clone () from /lib64/libc.so.6

Thread 522 (Thread 0x7f62e2ffd700 (LWP 10581)):
#0  0x00007f660a106de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007f65fc6a20ce in __afr_shd_healer_wait
(healer=healer at entry=0x7f65f253d690) at afr-self-heald.c:100
#2  0x00007f65fc6a214d in afr_shd_healer_wait
(healer=healer at entry=0x7f65f253d690) at afr-self-heald.c:121
#3  0x00007f65fc6a3c10 in afr_shd_index_healer (data=0x7f65f253d690) at
afr-self-heald.c:818
#4  0x00007f660a102ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f66099c88cd in clone () from /lib64/libc.so.6

Thread 521 (Thread 0x7f633bfff700 (LWP 10530)):
#0  0x00007f660a106de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007f65fc6a20ce in __afr_shd_healer_wait
(healer=healer at entry=0x7f65f230e280) at afr-self-heald.c:100
#2  0x00007f65fc6a214d in afr_shd_healer_wait
(healer=healer at entry=0x7f65f230e280) at afr-self-heald.c:121
#3  0x00007f65fc6a3c10 in afr_shd_index_healer (data=0x7f65f230e280) at
afr-self-heald.c:818
#4  0x00007f660a102ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f66099c88cd in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---q
Quit

n3:
===
warning: core file may not match specified executable file.
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from
/usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.
Cannot access memory at address 0x7fd91c78a128
Cannot access memory at address 0x7fd91c78a120
Failed to read a valid object file image from memory.
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id
gluster/glustershd -p /var/run/gl'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fd91c02226c in ?? ()
(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd90e586db8: 
(gdb) 
#0  0x00007fd91c02226c in ?? ()
Cannot access memory at address 0x7fd90e586db8
(gdb) 
#0  0x00007fd91c02226c in ?? ()
Cannot access memory at address 0x7fd90e586db8
(gdb) t a a bt

Thread 523 (LWP 10404):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd80effc020: 

Thread 522 (LWP 10742):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd472ffbce0: 

Thread 521 (LWP 10668):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd538ff8020: 

Thread 520 (LWP 10751):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd461ffa020: 

Thread 519 (LWP 10758):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd44dffa020: 

Thread 518 (LWP 10653):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd55d7f9020: 

Thread 517 (LWP 10489):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd70effc020: 

Thread 516 (LWP 10703):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd4dd7f9020: 

Thread 515 (LWP 10734):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd4877fcce0: 

Thread 514 (LWP 10399):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd815ffa020: 

Thread 513 (LWP 10396):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd816ffc020: 

Thread 512 (LWP 10772):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address
0x7fd429ffa020: 

Version-Release number of selected component (if applicable):
=============
6.0.29

How reproducible:
==================
hit on 2 different crashes on 2 different nodes at different intervals

Steps to Reproduce:
================
1) 3 node cluster with bmux enabled
2) created 500 vols, either of arbiter or x3 type
3) mounted all 500 vols on 3 different clients
4) started IOs, ie cp linux image to the mount point, and do untar. 
This is one mount at a time, but 3 screen sessions on each node and all
different set of volumes being selected.
Hence 9 linux untars parallely , on different volumes at any point of time
5) Did node reboot of n2(rhsqa11) every 15min(through ssh from n1 ie rhsqa10)
6) left it overnight

Actual results:
================
1) found the crash on n1 and n2
2) i do see transport end point errors for some of the volumes intermittently
when we do the reboot of n2.
this is consistently seen for some set of volumes(all arbiter) and the issue
goes away as soon as n2 is up

However we must not see that because always same node is brought down.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1794663
[Bug 1794663] [brickmux]: glustershd crashed when rebooting 1/3 nodes at
regular intervals
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.