[Bugs] [Bug 1315626] New: glusterd crashed when probing a node with firewall enabled on only one node

Tue Mar 8 09:26:14 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1315626

            Bug ID: 1315626
           Summary: glusterd crashed when probing a node with firewall
                    enabled on only one node
           Product: GlusterFS
           Version: 3.7.0
         Component: glusterd
          Keywords: Triaged
          Severity: high
          Priority: high
          Assignee: bugs at gluster.org
          Reporter: ggarg at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    mselvaga at redhat.com, sasundar at redhat.com
        Depends On: 1310677
            Blocks: 1314391

+++ This bug was initially created as a clone of Bug #1310677 +++

Description of problem:
-----------------------
glusterd crashed when  probing a node with firewall enabled on only one of the
node

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHEL 7.2
glusterfs-3.7.6

How reproducible:
-----------------
Haven't tried to reproduce

Steps to Reproduce:
-------------------
1. Install RHEL 7.2 + glusterfs on 2 nodes (say node1, node2 )
2. Add a firewall rule to open glusterd port 24007 only on one node ( say node2
)
3. Probe a peer - node2 -  from node1
4. Probe a peer - node1 - from node2

Actual results:
---------------
glusterd crashed on node2

Expected results:
-----------------
glusterd should not crash

--- Additional comment from SATHEESARAN on 2016-02-22 08:50:37 EST ---

[root at node2 ~]# gdb -c /core.9717
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
[New LWP 9720]
[New LWP 9725]
[New LWP 9718]
[New LWP 9726]
[New LWP 9719]
[New LWP 9732]
[New LWP 9724]
[New LWP 9723]
[New LWP 9717]
[New LWP 9721]

warning: core file may not match specified executable file.
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from
/usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.
Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/17/a121b1f7bbb010f54735ffde3347b27b33884d
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level
INFO'.
Program terminated with signal 11, Segmentation fault.
#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
24    1:    LOCK
(gdb) bt
#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
#1  0x00007fdd8397a45d in __gf_free (free_ptr=0x7fdd6c000cb0) at mem-pool.c:316
#2  0x00007fdd8393ee55 in data_destroy (data=<optimized out>) at dict.c:235
#3  0x00007fdd83941b79 in dict_get_str (this=<optimized out>, key=<optimized
out>, str=<optimized out>)
    at dict.c:2213
#4  0x00007fdd784adce9 in glusterd_xfer_cli_probe_resp
(req=req at entry=0x7fdd85c6811c, op_ret=op_ret at entry=-1, 
    op_errno=0, op_errstr=op_errstr at entry=0x0, hostname=0x7fdd6c000d80
"dhcp37-152", port=24007, 
    dict=0x7fdd83c17be4) at glusterd-handler.c:3894
#5  0x00007fdd784aea57 in __glusterd_handle_cli_probe
(req=req at entry=0x7fdd85c6811c) at glusterd-handler.c:1220
#6  0x00007fdd784a7540 in glusterd_big_locked_handler (req=0x7fdd85c6811c, 
    actor_fn=0x7fdd784ae590 <__glusterd_handle_cli_probe>) at
glusterd-handler.c:83
#7  0x00007fdd83988e32 in synctask_wrap (old_task=<optimized out>) at
syncop.c:380
#8  0x00007fdd82047110 in ?? () from /usr/lib64/libc-2.17.so
#9  0x0000000000000000 in ?? ()

--- Additional comment from SATHEESARAN on 2016-02-22 09:00:42 EST ---

Coredump error messages as seen in glusterd logs :

<snip>
The message "I [MSGID: 106004]
[glusterd-handler.c:5065:__glusterd_peer_rpc_notify] 0-management: Peer
<dhcp37-152.lab.eng.blr.redhat.com> (<4d46cc7a-6d17-460e-82ba-7f5624436fb0>),
in state <Accepted peer request>, has disconnected from glusterd." repeated 4
times between [2016-02-22 15:50:38.204058] and [2016-02-22 15:50:50.235773]
[2016-02-22 15:50:51.106009] I [MSGID: 106487]
[glusterd-handler.c:1178:__glusterd_handle_cli_probe] 0-glusterd: Received CLI
probe req dhcp37-152 24007
The message "I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop]
0-management:  already stopped" repeated 4 times between [2016-02-22
15:50:16.093916] and [2016-02-22 15:50:16.093939]
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2016-02-22 15:50:51
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.6
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7fdd83947012]
/lib64/libglusterfs.so.0(gf_print_trace+0x31d)[0x7fdd839634dd]
/lib64/libc.so.6(+0x35670)[0x7fdd82035670]
/lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x7fdd827b4210]
---------
(END)

</snip>

--- Additional comment from SATHEESARAN on 2016-02-22 09:08:02 EST ---

Console output from node1
-------------------------
[root at node1 ~]# gluster peer probe node2
peer probe: success.

[root at node1 ~]# gluster peer status
Number of Peers: 1

Hostname: node2
Uuid: df339e12-c30f-4a86-9977-ef4ac6d5a190
State: Accepted peer request (Connected)

Console output from node2
-------------------------

[root at node2 ~]# gluster peer status
Number of Peers: 1

Hostname: node1
Uuid: 4d46cc7a-6d17-460e-82ba-7f5624436fb0
State: Accepted peer request (Disconnected)

[root at node2 ~]# gluster peer probe node1
peer probe: success. Host dhcp37-152 port 24007 already in peer list

[root at node2 ~]# gluster peer status
Connection failed. Please check if gluster daemon is operational.
peer status: failed

--- Additional comment from SATHEESARAN on 2016-02-22 09:13:59 EST ---

I could hit this issue consistently

--- Additional comment from Gaurav Kumar Garg on 2016-02-29 05:42:49 EST ---

upstream patch for this bug is available: http://review.gluster.org/#/c/13546/

--- Additional comment from Vijay Bellur on 2016-03-01 00:59:35 EST ---

REVIEW: http://review.gluster.org/13546 (glusterd: glusterd was crashing when
peer probing of disconnect node of cluster) posted (#2) for review on master by
Gaurav Kumar Garg (ggarg at redhat.com)

--- Additional comment from Vijay Bellur on 2016-03-07 01:30:44 EST ---

REVIEW: http://review.gluster.org/13546 (glusterd: upon peer probe glusterd
should not return response to CLI two times) posted (#3) for review on master
by Gaurav Kumar Garg (ggarg at redhat.com)

--- Additional comment from Vijay Bellur on 2016-03-07 01:33:57 EST ---

REVIEW: http://review.gluster.org/13546 (glusterd:upon re-peer probe glusterd
should not return response to CLI two times) posted (#4) for review on master
by Gaurav Kumar Garg (ggarg at redhat.com)

--- Additional comment from Vijay Bellur on 2016-03-07 01:46:16 EST ---

REVIEW: http://review.gluster.org/13546 (glusterd:upon re-peer probe glusterd
should not return response to CLI two times) posted (#5) for review on master
by Atin Mukherjee (amukherj at redhat.com)

--- Additional comment from Vijay Bellur on 2016-03-07 23:08:27 EST ---

REVIEW: http://review.gluster.org/13546 (glusterd:upon re-peer probe glusterd
should not return response to CLI two times) posted (#6) for review on master
by Gaurav Kumar Garg (ggarg at redhat.com)

--- Additional comment from Vijay Bellur on 2016-03-08 04:10:36 EST ---

COMMIT: http://review.gluster.org/13546 committed in master by Atin Mukherjee
(amukherj at redhat.com) 
------
commit f44232e6a18a4b79e680ea0b6322269b84fa6813
Author: Gaurav Kumar Garg <garg.gaurav52 at gmail.com>
Date:   Mon Feb 29 15:48:58 2016 +0530

    glusterd:upon re-peer probe glusterd should not return response to CLI two
times

    If a node N1 and node N2 is part of the cluster and a node N2 try to
reprobe
    node N1 when N1 is disconnected by any means (for eg: either server is down
    or glusterd is not running or there is a network outage, or firewall is
    blocking port number 24007 on which glusterd listen, etc.), then glusterd
    trying to send back two responses to CLI resulting into a double free and
    a glusterd crash.

    With this fix glusterd  will send response to cli only once and prevent
    glusterd crash.

    Note: glusterd was crashing only when user has done first peer probe with
    hostname and re-probe with ip-address or vice-versa.

    Change-Id: I92012b147091cf9129f1fbc17834b3f4d7cb46a0
    BUG: 1310677
    Signed-off-by: Gaurav Kumar Garg <ggarg at redhat.com>
    Reviewed-on: http://review.gluster.org/13546
    Smoke: Gluster Build System <jenkins at build.gluster.com>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Atin Mukherjee <amukherj at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1310677
[Bug 1310677] glusterd crashed when probing a node with firewall enabled on
only one node
https://bugzilla.redhat.com/show_bug.cgi?id=1314391
[Bug 1314391] glusterd crashed when probing a node with firewall enabled on
only one node
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.