[Bugs] [Bug 1790670] New: glusterd crashed when trying to add node

Mon Jan 13 21:14:23 UTC 2020

https://bugzilla.redhat.com/show_bug.cgi?id=1790670

            Bug ID: 1790670
           Summary: glusterd crashed when trying to add node
           Product: GlusterFS
           Version: 7
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: glusterd
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: pizzi at leopardus.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Description of problem:

Trying to add a node to an existing cluster causes glusterd to crash when peer
probe is issued on any of the nodes of the existing cluster.

Tried several times, from all nodes, using both public and private interface
address for probe, to no avail.

Actually the join seem to succeed (peer status on other node says it joined),
but since glusterd crashes, on next execution something is corrupted and
glusterd will not start. 

Log of crashing glusterd:

[2020-01-13 20:45:25.057344] I [glusterd.c:1998:init] 0-management:
Regenerating volfiles due to a max op-version mismatch or glusterd.upgrade file
not being present, op_version retrieved:0, max op_version: 70000
Final graph:
+------------------------------------------------------------------------------+
  1: volume management
  2:     type mgmt/glusterd
  3:     option rpc-auth.auth-glusterfs on
  4:     option rpc-auth.auth-unix on
  5:     option rpc-auth.auth-null on
  6:     option rpc-auth-allow-insecure on
  7:     option transport.listen-backlog 1024
  8:     option max-port 60999
  9:     option event-threads 1
 10:     option ping-timeout 0
 11:     option transport.rdma.listen-port 24008
 12:     option transport.socket.listen-port 24007
 13:     option transport.socket.read-fail-log off
 14:     option transport.socket.keepalive-interval 2
 15:     option transport.socket.keepalive-time 10
 16:     option transport-type rdma
 17:     option working-directory /var/lib/glusterd
 18: end-volume
 19:  
+------------------------------------------------------------------------------+
[2020-01-13 20:45:25.069364] I [MSGID: 101190]
[event-epoll.c:682:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 0 
[2020-01-13 20:45:27.338940] I [MSGID: 106487]
[glusterd-handler.c:1339:__glusterd_handle_cli_list_friends] 0-glusterd:
Received cli list req 
[2020-01-13 20:45:53.609592] I [MSGID: 106163]
[glusterd-handshake.c:1433:__glusterd_mgmt_hndsk_versions_ack] 0-management:
using the op-version 40100 
[2020-01-13 20:45:53.609631] E [MSGID: 101032]
[store.c:493:gf_store_handle_retrieve] 0-: Path corresponding to
/var/lib/glusterd/glusterd.info. [No such file or directory]
[2020-01-13 20:45:53.609666] I [MSGID: 106477]
[glusterd.c:182:glusterd_uuid_generate_save] 0-management: generated UUID:
a4bc89ed-5100-4c82-942c-2e23126d8fef 
[2020-01-13 20:45:53.668763] I [MSGID: 106490]
[glusterd-handler.c:2789:__glusterd_handle_probe_query] 0-glusterd: Received
probe from uuid: 092a6cb9-b90d-4f21-a51d-c74a543e9dd8 
[2020-01-13 20:45:53.670396] I [MSGID: 106128]
[glusterd-handler.c:2824:__glusterd_handle_probe_query] 0-glusterd: Unable to
find peerinfo for host:
185-52-0-8.glusterfs-dynamic-98b69362-076a-11e9-b16e-00163cd18fae.default.svc.cluster.local
(24007) 
[2020-01-13 20:45:53.677593] W [MSGID: 106061]
[glusterd-handler.c:3315:glusterd_transport_inet_options_build] 0-glusterd:
Failed to get tcp-user-timeout 
[2020-01-13 20:45:53.677635] I [rpc-clnt.c:1014:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2020-01-13 20:45:53.680479] I [MSGID: 106498]
[glusterd-handler.c:3470:glusterd_friend_add] 0-management: connect returned 0 
[2020-01-13 20:45:53.680536] I [MSGID: 106493]
[glusterd-handler.c:2850:__glusterd_handle_probe_query] 0-glusterd: Responded
to
185-52-0-8.glusterfs-dynamic-98b69362-076a-11e9-b16e-00163cd18fae.default.svc.cluster.local,
op_ret: 0, op_errno: 0, ret: 0 
[2020-01-13 20:45:53.681379] I [MSGID: 106490]
[glusterd-handler.c:2434:__glusterd_handle_incoming_friend_req] 0-glusterd:
Received probe from uuid: 092a6cb9-b90d-4f21-a51d-c74a543e9dd8 
[2020-01-13 20:45:53.689395] I [MSGID: 106493]
[glusterd-handler.c:3715:glusterd_xfer_friend_add_resp] 0-glusterd: Responded
to
185-52-0-8.glusterfs-dynamic-98b69362-076a-11e9-b16e-00163cd18fae.default.svc.cluster.local
(0), ret: 0, op_ret: 0 
[2020-01-13 20:45:53.723341] I [MSGID: 106511]
[glusterd-rpc-ops.c:250:__glusterd_probe_cbk] 0-management: Received probe resp
from uuid: 092a6cb9-b90d-4f21-a51d-c74a543e9dd8, host:
185-52-0-8.glusterfs-dynamic-98b69362-076a-11e9-b16e-00163cd18fae.default.svc.cluster.local 
[2020-01-13 20:45:53.723376] I [MSGID: 106511]
[glusterd-rpc-ops.c:403:__glusterd_probe_cbk] 0-glusterd: Received resp to
probe req 
[2020-01-13 20:45:53.740791] I [MSGID: 106493]
[glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC
from uuid: 092a6cb9-b90d-4f21-a51d-c74a543e9dd8, host:
185-52-0-8.glusterfs-dynamic-98b69362-076a-11e9-b16e-00163cd18fae.default.svc.cluster.local,
port: 0 
[2020-01-13 20:45:53.746122] I [MSGID: 106492]
[glusterd-handler.c:2619:__glusterd_handle_friend_update] 0-glusterd: Received
friend update from uuid: 092a6cb9-b90d-4f21-a51d-c74a543e9dd8 
[2020-01-13 20:45:53.751125] W [MSGID: 106061]
[glusterd-handler.c:3315:glusterd_transport_inet_options_build] 0-glusterd:
Failed to get tcp-user-timeout 
[2020-01-13 20:45:53.751174] I [rpc-clnt.c:1014:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2020-01-13 20:45:53.753764] I [MSGID: 106498]
[glusterd-handler.c:3519:glusterd_friend_add_from_peerinfo] 0-management:
connect returned 0 
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2020-01-13 20:45:53
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 7.1

Sometimes a stack trace is printed:

/lib64/libglusterfs.so.0(+0x277ff)[0x7f19e70437ff]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f19e704e234]
/lib64/libc.so.6(+0x363b0)[0x7f19e56843b0]
/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x26c42)[0x7f19e134cc42]
/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x279ae)[0x7f19e134d9ae]
/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x2840f)[0x7f19e134e40f]
/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x2362e)[0x7f19e134962e]
/lib64/libgfrpc.so.0(+0x9695)[0x7f19e6de7695]
/lib64/libgfrpc.so.0(+0x9a0b)[0x7f19e6de7a0b]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f19e6de9a93]
/usr/lib64/glusterfs/7.1/rpc-transport/socket.so(+0x4468)[0x7f19e0564468]
/usr/lib64/glusterfs/7.1/rpc-transport/socket.so(+0xb861)[0x7f19e056b861]
/lib64/libglusterfs.so.0(+0x8e246)[0x7f19e70aa246]
/lib64/libpthread.so.0(+0x7e65)[0x7f19e5e86e65]
/lib64/libc.so.6(clone+0x6d)[0x7f19e574c88d]

If I restart glusterd, it will abort due to some corruption:

[2020-01-13 21:02:53.619924] I [MSGID: 100030] [glusterfsd.c:2867:main]
0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 7.1 (args:
/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) 
[2020-01-13 21:02:53.620277] I [glusterfsd.c:2594:daemonize] 0-glusterfs: Pid
of current running process is 80
[2020-01-13 21:02:53.622166] I [MSGID: 106478] [glusterd.c:1426:init]
0-management: Maximum allowed open file descriptors set to 65536 
[2020-01-13 21:02:53.622193] I [MSGID: 106479] [glusterd.c:1482:init]
0-management: Using /var/lib/glusterd as working directory 
[2020-01-13 21:02:53.622201] I [MSGID: 106479] [glusterd.c:1488:init]
0-management: Using /var/run/gluster as pid file working directory 
[2020-01-13 21:02:53.625927] I [socket.c:1014:__socket_server_bind]
0-socket.management: process started listening on port (24007)
[2020-01-13 21:02:53.627074] W [MSGID: 103071]
[rdma.c:4472:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel
creation failed [No such device]
[2020-01-13 21:02:53.627092] W [MSGID: 103055] [rdma.c:4782:init]
0-rdma.management: Failed to initialize IB Device 
[2020-01-13 21:02:53.627101] W [rpc-transport.c:366:rpc_transport_load]
0-rpc-transport: 'rdma' initialization failed
[2020-01-13 21:02:53.627171] W [rpcsvc.c:1981:rpcsvc_create_listener]
0-rpc-service: cannot create listener, initing the transport failed
[2020-01-13 21:02:53.627179] E [MSGID: 106244] [glusterd.c:1781:init]
0-management: creation of 1 listeners failed, continuing with succeeded
transport 
[2020-01-13 21:02:53.628336] I [socket.c:957:__socket_server_bind]
0-socket.management: closing (AF_UNIX) reuse check socket 12
[2020-01-13 21:02:54.544119] I [MSGID: 106513]
[glusterd-store.c:2257:glusterd_restore_op_version] 0-glusterd: retrieved
op-version: 40100 
[2020-01-13 21:02:54.544330] I [MSGID: 106498]
[glusterd-handler.c:3519:glusterd_friend_add_from_peerinfo] 0-management:
connect returned 0 
[2020-01-13 21:02:54.544595] E
[glusterd-handler.c:3275:glusterd_transport_inet_options_build]
(-->/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x8aebe) [0x7fd0ac39eebe]
-->/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x26c7a) [0x7fd0ac33ac7a]
-->/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x26b56) [0x7fd0ac33ab56]
) 0-: Assertion failed: hostname 
[2020-01-13 21:02:54.544620] E
[rpc-transport.c:655:rpc_transport_inet_options_build]
(-->/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x26c7a) [0x7fd0ac33ac7a]
-->/usr/lib64/glusterfs/7.1/xlator/mgmt/glusterd.so(+0x267fc) [0x7fd0ac33a7fc]
-->/lib64/libgfrpc.so.0(rpc_transport_inet_options_build+0x2b6)
[0x7fd0b1dd8156] ) 0-: Assertion failed: hostname 
The message "I [MSGID: 106498]
[glusterd-handler.c:3519:glusterd_friend_add_from_peerinfo] 0-management:
connect returned 0" repeated 2 times between [2020-01-13 21:02:54.544330] and
[2020-01-13 21:02:54.544420]
[2020-01-13 21:02:54.544637] E [MSGID: 101019] [xlator.c:629:xlator_init]
0-management: Initialization of volume 'management' failed, review your volfile
again 
[2020-01-13 21:02:54.544649] E [MSGID: 101066]
[graph.c:425:glusterfs_graph_init] 0-management: initializing translator failed 
[2020-01-13 21:02:54.544654] E [MSGID: 101176]
[graph.c:779:glusterfs_graph_activate] 0-graph: init failed 
[2020-01-13 21:02:54.544729] W [glusterfsd.c:1596:cleanup_and_exit]
(-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x55b5b2bbc19d]
-->/usr/sbin/glusterd(glusterfs_process_volfp+0x21d) [0x55b5b2bbc08d]
-->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55b5b2bbb48b] ) 0-: received
signum (-1), shutting down 
[2020-01-13 21:02:54.544764] W [mgmt-pmap.c:132:rpc_clnt_mgmt_pmap_signout]
0-glusterfs: failed to create XDR payload

firewall is open (all the 3 nodes have an ACCEPT rule for any port and protocol
in iptables)

Version-Release number of selected component (if applicable):

glusterfs-4.1.6-1.el7.x86_64
glusterfs-fuse-4.1.6-1.el7.x86_64
glusterfs-libs-4.1.6-1.el7.x86_64
glusterfs-client-xlators-4.1.6-1.el7.x86_64

Also tried with latest (4.1.7), issue is the same

How reproducible:

It happens on a new node that I am trying to add to cluster.
Did not happen on an identical previous node!

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.