[Bugs] [Bug 1635593] glusterd crashed in cleanup_and_exit when glusterd comes up with upgrade mode.

Wed Oct 3 11:15:21 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1635593

Atin Mukherjee <amukherj at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[GSS] glusterd crashed in   |glusterd crashed in
                   |cleanup_and_exit when       |cleanup_and_exit when
                   |upgrading to RHGS 3.4       |glusterd comes up with
                   |                            |upgrade mode.

--- Comment #1 from Atin Mukherjee <amukherj at redhat.com> ---
When glusterd is brought up with upgrade mode as on, it's observed that
glusterd crashes with following backtrace:

Backtrace:
=========
(gdb) bt
#0  _rcu_read_lock_bp () at urcu/static/urcu-bp.h:199
#1  rcu_read_lock_bp () at urcu-bp.c:271
#2  0x00007fb1d33cd256 in __glusterd_peer_rpc_notify (rpc=0x7fb1df49c8d0, 
    mydata=<value optimized out>, event=RPC_CLNT_DISCONNECT, 
    data=<value optimized out>) at glusterd-handler.c:4996
#3  0x00007fb1d33b0c50 in glusterd_big_locked_notify (rpc=0x7fb1df49c8d0, 
    mydata=0x7fb1df49c250, event=RPC_CLNT_DISCONNECT, data=0x0, 
    notify_fn=0x7fb1d33cd1f0 <__glusterd_peer_rpc_notify>)
    at glusterd-handler.c:71
#4  0x00007fb1de793953 in rpc_clnt_notify (trans=<value optimized out>, 
    mydata=0x7fb1df49c900, event=<value optimized out>, 
    data=<value optimized out>) at rpc-clnt.c:861
#5  0x00007fb1de78ead8 in rpc_transport_notify (this=<value optimized out>, 
    event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:543
#6  0x00007fb1d1a53df1 in socket_event_poll_err (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:1205
#7  socket_event_handler (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:2410
#8  0x00007fb1dea27970 in event_dispatch_epoll_handler (data=0x7fb1df4edda0)
    at event-epoll.c:575
#9  event_dispatch_epoll_worker (data=0x7fb1df4edda0) at event-epoll.c:678
#10 0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fb1dd41896d in clone () from /lib64/libc.so.6
(gdb)

(gdb) t a a bt

Thread 7 (Thread 0x7fb1cd6a7700 (LWP 10080)):
#0  0x00007fb1ddab2a0e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007fb1dea0acab in syncenv_task (proc=0x7fb1df34bb00) at syncop.c:595
#2  0x00007fb1dea0fba0 in syncenv_processor (thdata=0x7fb1df34bb00)
    at syncop.c:687
#3  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fb1d5f03700 (LWP 2914)):
#0  0x00007fb1ddab5fbd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007fb1de9e55ca in gf_timer_proc (ctx=0x7fb1df31d010) at timer.c:205
#2  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fb1d4100700 (LWP 2917)):
#0  0x00007fb1ddab2a0e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007fb1dea0acab in syncenv_task (proc=0x7fb1df34afc0) at syncop.c:595
#2  0x00007fb1dea0fba0 in syncenv_processor (thdata=0x7fb1df34afc0)
    at syncop.c:687
#3  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fb1d02c3700 (LWP 3091)):
#0  0x00007fb1ddab263c in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007fb1d3465973 in hooks_worker (args=<value optimized out>)
    at glusterd-hooks.c:534
#2  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fb1dee72740 (LWP 2913)):
#0  0x00007fb1ddaaf2ad in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fb1dea2741d in event_dispatch_epoll (event_pool=0x7fb1df33bc90)
    at event-epoll.c:762
#2  0x00007fb1dee8eef1 in main (argc=2, argv=0x7ffdfed58a08)
    at glusterfsd.c:2333

Thread 2 (Thread 0x7fb1d5502700 (LWP 2915)):
#0  0x00007fb1d2c1cf18 in _fini () from /usr/lib64/liburcu-cds.so.1.0.0
#1  0x00007fb1dec72c7c in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#2  0x00007fb1dd365b22 in exit () from /lib64/libc.so.6
#3  0x00007fb1dee8cc03 in cleanup_and_exit (signum=<value optimized out>)
    at glusterfsd.c:1276
#4  0x00007fb1dee8d075 in glusterfs_sigwaiter (arg=<value optimized out>)
    at glusterfsd.c:1997
#5  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fb1cf8c2700 (LWP 3092)):
#0  _rcu_read_lock_bp () at urcu/static/urcu-bp.h:199
#1  rcu_read_lock_bp () at urcu-bp.c:271
#2  0x00007fb1d33cd256 in __glusterd_peer_rpc_notify (rpc=0x7fb1df49c8d0, 
    mydata=<value optimized out>, event=RPC_CLNT_DISCONNECT, 
    data=<value optimized out>) at glusterd-handler.c:4996
#3  0x00007fb1d33b0c50 in glusterd_big_locked_notify (rpc=0x7fb1df49c8d0, 
    mydata=0x7fb1df49c250, event=RPC_CLNT_DISCONNECT, data=0x0, 
---Type <return> to continue, or q <return> to quit---
    notify_fn=0x7fb1d33cd1f0 <__glusterd_peer_rpc_notify>)
    at glusterd-handler.c:71
#4  0x00007fb1de793953 in rpc_clnt_notify (trans=<value optimized out>, 
    mydata=0x7fb1df49c900, event=<value optimized out>, 
    data=<value optimized out>) at rpc-clnt.c:861
#5  0x00007fb1de78ead8 in rpc_transport_notify (this=<value optimized out>, 
    event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:543
#6  0x00007fb1d1a53df1 in socket_event_poll_err (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:1205
#7  socket_event_handler (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:2410
#8  0x00007fb1dea27970 in event_dispatch_epoll_handler (data=0x7fb1df4edda0)
    at event-epoll.c:575
#9  event_dispatch_epoll_worker (data=0x7fb1df4edda0) at event-epoll.c:678
#10 0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fb1dd41896d in clone () from /lib64/libc.so.6
(gdb) 
(gdb) 

The root cause of the issue goes following:

This is a race at the clean up part where the URCU resources were already
cleaned up by clean up thread and other thread was still accessing the
resource. Since the current implementation doesn't take care of synchronizing
the threads in respect to clean up, that's why its been observed. It's more
observed in during upgrade as glusterd in this mode just regenerates the
volfile and shutdown but on a multi node setup, glusterd also receives rpc
connect event from other peers during processing the cleanup_and_exit () which
leads to this race more often.

-- 
You are receiving this mail because:
You are the assignee for the bug.