[Bugs] [Bug 1597229] New: glustershd crashes when index heal is launched before graph is initialized.

Mon Jul 2 10:11:42 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1597229

            Bug ID: 1597229
           Summary: glustershd crashes when index heal is launched before
                    graph is initialized.
           Product: GlusterFS
           Version: 4.1
         Component: core
          Keywords: Triaged
          Assignee: bugs at gluster.org
          Reporter: ravishankar at redhat.com
                CC: bugs at gluster.org
        Depends On: 1596513
            Blocks: 1460245, 1593865, 1595752

+++ This bug was initially created as a clone of Bug #1596513 +++

Description of problem:
glustershd crashes when index heal is launched via CLI before graph is
initialized.

Version-Release number of selected component (if applicable)/ How reproducible:
I'm able to reproduce this easily on glusterfs-3.8.4 and very infrequently on
glusterfs-3.12.2 (only once on 3.12.2)

Steps to Reproduce:
1. create a replica 2 volume and start it.
2. `while true; do gluster volume heal <volname>;sleep 0.5; done` in one
terminal.
3. In another terminal, keep running 'service glusterd restart`

Actual results:
Once in a while shd crashes and never comes up until manually restarted:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id
gluster/glustershd -p /var/'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000040cdfa in glusterfs_handle_translator_op (req=0x7f0fa4003490)
at glusterfsd-mgmt.c:793
793             any = active->first;
[Current thread is 1 (Thread 0x7f0face2c700 (LWP 3716))]
Missing separate debuginfos, use: dnf debuginfo-install
glibc-2.25-12.fc26.x86_64 libgcc-7.2.1-2.fc26.x86_64
libuuid-2.30.2-1.fc26.x86_64 openssl-libs-1.1.0f-7.fc26.x86_64
sssd-client-1.16.0-1.fc26.x86_64 zlib-1.2.11-2.fc26.x86_64
(gdb) t a a bt

Thread 7 (Thread 0x7f0fabd86700 (LWP 3717)):
#0  0x00007f0fb65adce6 in fnmatch@@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007f0fb7f43f42 in gf_add_cmdline_options (graph=0x7f0fa4000c40,
cmd_args=0x15c2010) at graph.c:299
#2  0x00007f0fb7f449c0 in glusterfs_graph_prepare (graph=0x7f0fa4000c40,
ctx=0x15c2010, volume_name=0x0) at graph.c:588
#3  0x000000000040a74b in glusterfs_process_volfp (ctx=0x15c2010,
fp=0x7f0fa4006920) at glusterfsd.c:2368
#4  0x000000000040fc81 in mgmt_getspec_cbk (req=0x7f0fa4001d10,
iov=0x7f0fa4001d50, count=1, myframe=0x7f0fa4001560) at glusterfsd-mgmt.c:1989
#5  0x00007f0fb7cc26b5 in rpc_clnt_handle_reply (clnt=0x163fef0,
pollin=0x7f0fa40061b0) at rpc-clnt.c:778
#6  0x00007f0fb7cc2c53 in rpc_clnt_notify (trans=0x1640120, mydata=0x163ff20,
event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f0fa40061b0) at rpc-clnt.c:971
#7  0x00007f0fb7cbecb8 in rpc_transport_notify (this=0x1640120,
event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f0fa40061b0) at rpc-transport.c:538
#8  0x00007f0fac41919e in socket_event_poll_in (this=0x1640120,
notify_handled=_gf_true) at socket.c:2315
#9  0x00007f0fac4197c3 in socket_event_handler (fd=10, idx=1, gen=1,
data=0x1640120, poll_in=1, poll_out=0, poll_err=0) at socket.c:2467
#10 0x00007f0fb7f6d367 in event_dispatch_epoll_handler (event_pool=0x15f9240,
event=0x7f0fabd85e94) at event-epoll.c:583
#11 0x00007f0fb7f6d63e in event_dispatch_epoll_worker (data=0x1642f90) at
event-epoll.c:659
#12 0x00007f0fb6d3736d in start_thread () from /lib64/libpthread.so.0
#13 0x00007f0fb65e0e1f in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f0fad62d700 (LWP 3715)):
#0  0x00007f0fb6d3deb6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007f0fb7f48274 in syncenv_task (proc=0x16033c0) at syncop.c:603
#2  0x00007f0fb7f4850f in syncenv_processor (thdata=0x16033c0) at syncop.c:695
#3  0x00007f0fb6d3736d in start_thread () from /lib64/libpthread.so.0
#4  0x00007f0fb65e0e1f in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f0fade2e700 (LWP 3714)):
#0  0x00007f0fb65a4c0d in nanosleep () from /lib64/libc.so.6
#1  0x00007f0fb65a4b4a in sleep () from /lib64/libc.so.6
#2  0x00007f0fb7f32762 in pool_sweeper (arg=0x0) at mem-pool.c:481
#3  0x00007f0fb6d3736d in start_thread () from /lib64/libpthread.so.0
#4  0x00007f0fb65e0e1f in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f0fae62f700 (LWP 3713)):
#0  0x00007f0fb6d41f56 in sigwait () from /lib64/libpthread.so.0
#1  0x000000000040a001 in glusterfs_sigwaiter (arg=0x7fff4608dcd0) at
glusterfsd.c:2137
#2  0x00007f0fb6d3736d in start_thread () from /lib64/libpthread.so.0
#3  0x00007f0fb65e0e1f in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f0faee30700 (LWP 3712)):
#0  0x00007f0fb6d4192d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f0fb7f0ee1c in gf_timer_proc (data=0x1600ed0) at timer.c:174
#2  0x00007f0fb6d3736d in start_thread () from /lib64/libpthread.so.0
#3  0x00007f0fb65e0e1f in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f0fb83fe780 (LWP 3711)):
#0  0x00007f0fb6d3883d in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f0fb7f6d89c in event_dispatch_epoll (event_pool=0x15f9240) at
event-epoll.c:746
#2  0x00007f0fb7f30f3a in event_dispatch (event_pool=0x15f9240) at event.c:124
#3  0x000000000040acce in main (argc=13, argv=0x7fff4608eec8) at
glusterfsd.c:2550

Thread 1 (Thread 0x7f0face2c700 (LWP 3716)):
#0  0x000000000040cdfa in glusterfs_handle_translator_op (req=0x7f0fa4003490)
at glusterfsd-mgmt.c:793
#1  0x00007f0fb7f47a44 in synctask_wrap () at syncop.c:375
#2  0x00007f0fb651c950 in ?? () from /lib64/libc.so.6
#3  0x0000000000000000 in ?? ()
(gdb) l
788                     goto out;
789             }
790
791             ctx = glusterfsd_ctx;
792             active = ctx->active;
793             any = active->first;
794             input = dict_new ();
795             ret = dict_unserialize (xlator_req.input.input_val,
796                                     xlator_req.input.input_len,
797                                     &input);
(gdb) p ctx->active
$1 = (glusterfs_graph_t *) 0x0

Expected results:
shd must not crash

--- Additional comment from Worker Ant on 2018-06-29 03:28:22 EDT ---

REVIEW: https://review.gluster.org/20422 (glusterfsd: Do not process
GLUSTERD_BRICK_XLATOR_OP if graph is not ready) posted (#1) for review on
master by Ravishankar N

--- Additional comment from Worker Ant on 2018-07-02 06:10:49 EDT ---

COMMIT: https://review.gluster.org/20422 committed in master by "Atin
Mukherjee" <amukherj at redhat.com> with a commit message- glusterfsd: Do not
process GLUSTERD_BRICK_XLATOR_OP if graph is not ready

Problem:
If glustershd gets restarted by glusterd due to node reboot/volume start force/
or any thing that changes shd graph (add/remove brick), and index heal
is launched via CLI, there can be a chance that shd receives this IPC
before the graph is fully active. Thus when it accesses
glusterfsd_ctx->active, it crashes.

Fix:
Since glusterd does not really wait for the daemons it spawned to be
fully initialized and can send the request as soon as rpc initialization has
succeeded, we just handle it at shd. If glusterfs_graph_activate() is
not yet done in shd but glusterd sends GD_OP_HEAL_VOLUME to shd,
we fail the request.

Change-Id: If6cc07bc5455c4ba03458a36c28b63664496b17d
fixes: bz#1596513
Signed-off-by: Ravishankar N <ravishankar at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1460245
[Bug 1460245] [GSS]Glustershd process crashes intermittently on SSL enabled
volume in RHGS 3.2
https://bugzilla.redhat.com/show_bug.cgi?id=1593865
[Bug 1593865] shd crash on startup
https://bugzilla.redhat.com/show_bug.cgi?id=1596513
[Bug 1596513] glustershd crashes when index heal is launched before graph
is initialized.
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.