[Bugs] [Bug 1320374] New: Glusterd crashed just after a peer probe command failed.

bugzilla at redhat.com bugzilla at redhat.com
Wed Mar 23 04:38:33 UTC 2016


https://bugzilla.redhat.com/show_bug.cgi?id=1320374

            Bug ID: 1320374
           Summary: Glusterd crashed just after a peer probe command
                    failed.
           Product: GlusterFS
           Version: 3.7.9
         Component: glusterd
          Assignee: bugs at gluster.org
          Reporter: amukherj at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    rkavunga at redhat.com, sasundar at redhat.com
        Depends On: 1318546



+++ This bug was initially created as a clone of Bug #1318546 +++

Description of problem:

If a peer probe command failed because of an unresolvable IP, and just after
that if we run gluster volume stop command, both together resulted in glusterd
crash.



Version-Release number of selected component (if applicable):

mainline.

How reproducible:
50%

Steps to Reproduce:
1.create a volume.
2.Do a peer probe on an invalid IP (eg: a.b.c.d)
3.stop the volume
4. Or Run the test ./tests/bugs/glusterfs/bug-879490.t from gluster source

Actual results:

Glusterd crashed

Expected results:

Glusterd should not crash

Additional info:

--- Additional comment from Mohammed Rafi KC on 2016-03-17 04:24:06 EDT ---

(gdb) bt
#0  0x00007f5c0e5c9fc6 in dict_lookup_common (this=0x7f5bec004e3c,
key=0x7f5c04d56970 "cmd-str") at dict.c:287
#1  0x00007f5c0e5cc6d1 in dict_get_with_ref (this=0x7f5bec004e3c,
key=0x7f5c04d56970 "cmd-str", data=0x7f5c01c5f1c0) at dict.c:1397
#2  0x00007f5c0e5cdae2 in dict_get_str (this=0x7f5bec004e3c, key=0x7f5c04d56970
"cmd-str", str=0x7f5c01c5f238) at dict.c:2139
#3  0x00007f5c04c4f901 in glusterd_xfer_cli_probe_resp (req=0x7f5bf800093c,
op_ret=-1, op_errno=107, op_errstr=0x0, hostname=0x7f5bec09ca70 "a.b.c.d",
port=24007, dict=0x7f5bec004e3c)
    at glusterd-handler.c:3944
#4  0x00007f5c04c53107 in glusterd_friend_remove_notify
(peerctx=0x7f5bec003de0, op_errno=107) at glusterd-handler.c:5080
#5  0x00007f5c04c537c5 in __glusterd_peer_rpc_notify (rpc=0x7f5bec004310,
mydata=0x7f5bec003de0, event=RPC_CLNT_DISCONNECT, data=0x0) at
glusterd-handler.c:5210
#6  0x00007f5c04c44238 in cds_list_add_tail_rcu (newp=0x7f5bec004310,
head=0x7f5bec003de0) at ../../../../contrib/userspace-rcu/rculist-extra.h:36
#7  0x00007f5c04c538aa in __glusterd_peer_rpc_notify (rpc=0x7f5c01c60700,
mydata=0x7f5c01c60700, event=RPC_CLNT_CONNECT, data=0x1c5fbf0) at
glusterd-handler.c:5234
#8  0x00007f5c0e39c5a4 in rpc_clnt_notify (trans=0x7f5bec09e520,
mydata=0x7f5bec004340, event=RPC_TRANSPORT_DISCONNECT, data=0x7f5bec09e520) at
rpc-clnt.c:867
#9  0x00007f5c0e398ac9 in rpc_transport_notify (this=0x7f5bec09e520,
event=RPC_TRANSPORT_DISCONNECT, data=0x7f5bec09e520) at rpc-transport.c:541
#10 0x00007f5c03cb34f6 in socket_connect_error_cbk (opaque=0x7f5bf0000b90) at
socket.c:2814
#11 0x0000003400607ee5 in start_thread (arg=0x7f5c01c60700) at
pthread_create.c:309
#12 0x00000034002f4d1d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

--- Additional comment from Mohammed Rafi KC on 2016-03-17 04:29:44 EDT ---

glusterd_friend_remove_notify was called two times that resulted in a accessing
an already freed dictionary stored in peerctx.args.

I guess glusterd_friend_remove_notify was called second time as part of the
automatic timer based reconnect logic. Since the reconnect will always fails
this can cause a crash.

--- Additional comment from Vijay Bellur on 2016-03-20 09:35:39 EDT ---

REVIEW: http://review.gluster.org/13790 (glusterd/rpc : Discard duplicate
Disconnect events) posted (#1) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Mohammed Rafi KC on 2016-03-22 06:32:08 EDT ---

Newly created rpc for the friend will be cleared through friend_sm. If the
friend_sm takes time more than the reconnect then only this crash can hit.

--- Additional comment from Vijay Bellur on 2016-03-22 15:25:04 EDT ---

COMMIT: http://review.gluster.org/13790 committed in master by Jeff Darcy
(jdarcy at redhat.com) 
------
commit 1081584d4c2d26e56fea623ecfadd305c6e3d3bc
Author: Atin Mukherjee <amukherj at redhat.com>
Date:   Sun Mar 20 18:31:00 2016 +0530

    glusterd/rpc : Discard duplicate Disconnect events

    If a peer rpc disconnect event has been already processed, skip the
furthers as
    processing them are overheads and sometimes may lead to a crash like due to
a
    double free

    Change-Id: Iec589ce85daf28fd5b267cb6fc82a4238e0e8adc
    BUG: 1318546
    Signed-off-by: Atin Mukherjee <amukherj at redhat.com>
    Reviewed-on: http://review.gluster.org/13790
    Smoke: Gluster Build System <jenkins at build.gluster.com>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Jeff Darcy <jdarcy at redhat.com>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1318546
[Bug 1318546] Glusterd crashed just after a peer probe command failed.
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list