[Bugs] [Bug 1434399] New: glusterd crashes when peering an IP where the address is more than acceptable range (>255) OR with random hostnames

bugzilla at redhat.com bugzilla at redhat.com
Tue Mar 21 12:33:07 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1434399

            Bug ID: 1434399
           Summary: glusterd crashes when peering an IP where the address
                    is more than acceptable range (>255) OR with random
                    hostnames
           Product: GlusterFS
           Version: 3.10
         Component: glusterd
          Keywords: Triaged
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: amukherj at redhat.com
                CC: amukherj at redhat.com, asoman at redhat.com,
                    bsrirama at redhat.com, bugs at gluster.org,
                    mchangir at redhat.com, nchilaka at redhat.com,
                    rcyriac at redhat.com, rhs-bugs at redhat.com,
                    storage-qa-internal at redhat.com, vbellur at redhat.com
        Depends On: 1433578
            Blocks: 1433276



+++ This bug was initially created as a clone of Bug #1433578 +++

+++ This bug was initially created as a clone of Bug #1433276 +++

Description of problem:
==============
when we try to peer probe a node where the IP addr has the range more than 255,
the glusterd is crashing consistently(alteast 95% times, checked this on 5
different setups)
Issue a gluster peer probe 10.70.35.1221 ===> note that the last part is a 4
digit
glusterd crashes

This is consistent and can easily happen if the admin makes a typo mistake,
which is quite possible


Check on 3.1.3 (3.7.9-10), i couldn't reproduce.
on 3.8.4-18, mention anything above 255 it crashes


Core details:
[root at dhcp35-138 ~]# file /core.30402 
/core.30402: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
from '/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO', real uid:
0, effective uid: 0, real gid: 0, effective gid: 0, execfn:
'/usr/sbin/glusterd', platform: 'x86_64'
[root at dhcp35-138 ~]# gdb /usr/sbin/glusterd /core.30402
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from
/usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.

warning: core file may not match specified executable file.
[New LWP 29703]
[New LWP 30405]
[New LWP 30403]
[New LWP 30404]
[New LWP 30406]
[New LWP 30402]
[New LWP 30607]
[New LWP 30608]
[New LWP 29704]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level
INFO'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
314            GF_ASSERT (GF_MEM_TRAILER_MAGIC ==
Missing separate debuginfos, use: debuginfo-install
bzip2-libs-1.0.6-13.el7.x86_64
device-mapper-event-libs-1.02.135-1.el7_3.3.x86_64
device-mapper-libs-1.02.135-1.el7_3.3.x86_64 elfutils-libelf-0.166-2.el7.x86_64
elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64
libattr-2.4.46-12.el7.x86_64 libblkid-2.23.2-33.el7.x86_64
libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64
libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64
libsepol-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64
libxml2-2.9.1-6.el7_2.3.x86_64 lvm2-libs-2.02.166-1.el7_3.3.x86_64
openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64
systemd-libs-219-30.el7_3.7.x86_64 userspace-rcu-0.7.9-2.el7rhgs.x86_64
xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at
rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup
(conn=conn at entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390,
clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390,
event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>,
event=event at entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at
socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at
rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup
(conn=conn at entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390,
clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390,
event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>,
event=event at entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at
socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at
rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup
(conn=conn at entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390,
clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390,
event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>,
event=event at entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at
socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at
rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup
(conn=conn at entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390,
clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390,
event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>,
event=event at entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at
socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at
rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup
(conn=conn at entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390,
clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390,
event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>,
event=event at entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at
socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at
rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup
(conn=conn at entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390,
clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390,
event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>,
event=event at entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at
socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at
rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup
(conn=conn at entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390,
clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390,
event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>,
event=event at entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at
socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at
rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup
(conn=conn at entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390,
clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390,
event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>,
event=event at entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at
socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at
rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup
(conn=conn at entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390,
clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390,
event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>,
event=event at entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at
socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 
#0  0x00007fd5da47aea5 in __gf_free (free_ptr=0x7fd5b5620040) at mem-pool.c:314
#1  0x00007fd5da21c9e7 in saved_frames_destroy (frames=<optimized out>) at
rpc-clnt.c:388
#2  0x00007fd5da21e140 in rpc_clnt_connection_cleanup
(conn=conn at entry=0x7fd5b53a4390) at rpc-clnt.c:557
#3  0x00007fd5da21ec00 in rpc_clnt_handle_disconnect (conn=0x7fd5b53a4390,
clnt=0x7fd5b53a4360) at rpc-clnt.c:900
#4  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd5b53a4390,
event=<optimized out>, data=0x7fd5b5610f30)
    at rpc-clnt.c:953
#5  0x00007fd5da21a9f3 in rpc_transport_notify (this=<optimized out>,
event=event at entry=RPC_TRANSPORT_DISCONNECT, 
    data=<optimized out>) at rpc-transport.c:538
#6  0x00007fd5cc032b2d in socket_connect_error_cbk (opaque=0x7fd5b55b2070) at
socket.c:2927
#7  0x00007fd5d92b5dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd5d8bfa73d in clone () from /lib64/libc.so.6
(gdb) 









Version-Release number of selected component (if applicable):
===
3.8.4-18

How reproducible:
====
always(or say 95% times)

Steps to Reproduce:
1.setup a gluster node
2.issue a peer probe to say 10.70.35.x (where x is >255)
3.glusterd crashes

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-03-17
05:52:01 EDT ---

This bug is automatically being proposed for the current release of Red Hat
Gluster Storage 3 under active development, by setting the release flag
'rhgs‑3.3.0' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Ambarish on 2017-03-17 05:59:36 EDT ---

I hit this on my setup as well just now .

[root at localhost bricks]# gluster peer probe 10.70.37.12345
peer probe: failed: Probe returned with Transport endpoint is not connected
[root at localhost bricks]# 


The weird thing is I see this file getting created with the wrong/random
hostname :

[root at localhost peers]# ll -h /var/lib/glusterd/peers/
total 12K
-rw-------. 1 root root 73 Mar 17 05:52 02ef4e27-a38e-4e1e-8b75-a0657c2eae6b
-rw-------. 1 root root 75 Mar 17 05:52 10.70.37.12345     -----> BAD
-rw-------. 1 root root 94 Mar 17 05:52 f6384f3a-ab69-4757-8fc8-eda43bd17c2e
[root at localhost peers]# 


[root at localhost peers]# cat 10.70.37.12345 
uuid=00000000-0000-0000-0000-000000000000
state=0
hostname1=10.70.37.12345
[root at localhost peers]# 


Peer Status fails on the crashed node as well :

[root at localhost peers]# gluster peer status
peer status: failed
[root at localhost peers]# 



Though it works fine on other nodes :

[root at localhost /]# gluster peer status
Number of Peers: 2

Hostname: 10.70.37.65
Uuid: 32095651-cbda-40e8-941c-6b75c260610e
State: Peer in Cluster (Connected)

Hostname: 10.70.37.116
Uuid: 02ef4e27-a38e-4e1e-8b75-a0657c2eae6b
State: Peer in Cluster (Connected)
[root at localhost /]#

--- Additional comment from Ambarish on 2017-03-17 06:03:30 EDT ---

The issue is reproducible if I give peer probe "abcd" as well.

Samikshan shared a similar upstream BZ -
https://bugzilla.redhat.com/show_bug.cgi?id=770048 ,which got later closed as
WFM as noone could reproduce it.

But it's very very consistent now.

--- Additional comment from Atin Mukherjee on 2017-03-17 11:39:12 EDT ---

https://review.gluster.org/#/c/15916 has caused this regression, further
analysis to follow on.

--- Additional comment from Atin Mukherjee on 2017-03-17 11:47:13 EDT ---

(In reply to Atin Mukherjee from comment #4)
> https://review.gluster.org/#/c/15916 has caused this regression, further
> analysis to follow on.

Ignore this. Doesn't look like the same patch which is culprit.

--- Additional comment from Milind Changire on 2017-03-17 15:31:29 EDT ---

When the erroneous IP Address does not pass the test for valid_ipv4_address()
the test for valid_host_name() passes and the IP Address with typo is assumed
as a dotted FQDN and is handed over to glusterd for processing.

We could mitigate this problem of erroneous input forwarding by ensuring that
the host name resolves to a valid IP Address in the cli before passing the host
name to glusterd.

However, we do need to RCA the assertion failure during saved_frames_destroy()

I wonder if this result can be seen on a ping-timer-expiry when FOP processing
is held for a long time in a gdb debug session on other node to simulate a busy
brick.

--- Additional comment from Worker Ant on 2017-03-18 07:06:53 EDT ---

REVIEW: https://review.gluster.org/16914 (rpc: bump up conn->cleanup_gen in
rpc_clnt_reconnect_cleanup) posted (#1) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Worker Ant on 2017-03-18 08:03:06 EDT ---

REVIEW: https://review.gluster.org/16914 (rpc: bump up conn->cleanup_gen in
rpc_clnt_reconnect_cleanup) posted (#2) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Worker Ant on 2017-03-20 19:34:20 EDT ---

COMMIT: https://review.gluster.org/16914 committed in master by Jeff Darcy
(jeff at pl.atyp.us) 
------
commit 39e09ad1e0e93f08153688c31433c38529f93716
Author: Atin Mukherjee <amukherj at redhat.com>
Date:   Sat Mar 18 16:29:10 2017 +0530

    rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup

    Commit 086436a introduced generation number (cleanup_gen) to ensure that
    rpc layer doesn't end up cleaning up the connection object if
    application layer has already destroyed it. Bumping up cleanup_gen was
    done only in rpc_clnt_connection_cleanup (). However the same is needed
    in rpc_clnt_reconnect_cleanup () too as with out it if the object gets
destroyed
    through the reconnect event in the application layer, rpc layer will
    still end up in trying to delete the object resulting into double free
    and crash.

    Peer probing an invalid host/IP was the basic test to catch this issue.

    Change-Id: Id5332f3239cb324cead34eb51cf73d426733bd46
    BUG: 1433578
    Signed-off-by: Atin Mukherjee <amukherj at redhat.com>
    Reviewed-on: https://review.gluster.org/16914
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    Reviewed-by: Milind Changire <mchangir at redhat.com>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Jeff Darcy <jeff at pl.atyp.us>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1433276
[Bug 1433276] glusterd crashes when peering an IP where the address is more
than acceptable range (>255) OR with random hostnames
https://bugzilla.redhat.com/show_bug.cgi?id=1433578
[Bug 1433578] glusterd crashes when peering an IP where the address is more
than acceptable range (>255) OR with random hostnames
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list