[Bugs] [Bug 1449495] New: glfsheal: crashed(segfault) with disperse volume in RDMA
bugzilla at redhat.com
bugzilla at redhat.com
Wed May 10 07:13:18 UTC 2017
https://bugzilla.redhat.com/show_bug.cgi?id=1449495
Bug ID: 1449495
Summary: glfsheal: crashed(segfault) with disperse volume in
RDMA
Product: GlusterFS
Version: 3.10
Component: rdma
Severity: high
Assignee: bugs at gluster.org
Reporter: potatogim at gluesys.com
CC: bugs at gluster.org
Created attachment 1277525
--> https://bugzilla.redhat.com/attachment.cgi?id=1277525&action=edit
rdma.patch
Description of problem:
In 3.10.1, glfsheal with disperse volume is always crushed in RDMA environment.
Version-Release number of selected component (if applicable): v3.10.1
How reproducible:
Steps to Reproduce:
1. install Mellanox OFED packages(librdmacm, libibverbs, etc.)
2. gluster volume create <vol> disperse 3 server{1..4}:<vol> transport rdma
3. run glfsheal
Actual results:
[root at server-1 ~]# glfsheal IBTEST
Segmentation fault (core dumped)
Expected results:
[root at server-1 ~]# glfsheal IBTEST
Brick 10.10.1.220:/volume/IBTEST
<gfid:d338c46e-bff6-4da0-b962-590ef3a19102> - Is in split-brain
...
Additional info:
- coredump with gdb
Core was generated by `/usr/sbin/glfsheal IBTEST'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007efc1467b56c in __gf_rdma_teardown (this=0x7efc08032740) at
rdma.c:3255
3255 if (peer->cm_id->qp != NULL) {
(gdb) bt
#0 0x00007efc1467b56c in __gf_rdma_teardown (this=0x7efc08032740) at
rdma.c:3255
#1 0x00007efc1467b6b0 in gf_rdma_teardown (this=0x7efc08032740, port=<value
optimized out>) at rdma.c:3287
#2 gf_rdma_connect (this=0x7efc08032740, port=<value optimized out>) at
rdma.c:4769
#3 0x00007efc23df24e9 in rpc_clnt_reconnect (conn_ptr=0x7efc080325d0) at
rpc-clnt.c:422
#4 0x00007efc23df25d6 in rpc_clnt_start (rpc=0x7efc080325a0) at
rpc-clnt.c:1210
#5 0x00007efc16b55c53 in notify (this=0x7efc0801cec0, event=1,
data=0x7efc0801e930) at client.c:2354
#6 0x00007efc2423f592 in xlator_notify (xl=0x7efc0801cec0, event=1,
data=0x7efc0801e930) at xlator.c:566
#7 0x00007efc242c07c7 in default_notify (this=0x7efc0801e930, event=1,
data=0x7efc08020270) at defaults.c:3090
#8 0x00007efc155d5d18 in notify (this=<value optimized out>, event=<value
optimized out>, data=<value optimized out>) at snapview-client.c:2393
#9 0x00007efc2423f592 in xlator_notify (xl=0x7efc0801e930, event=1,
data=0x7efc08020270) at xlator.c:566
#10 0x00007efc242c07c7 in default_notify (this=0x7efc08020270, event=1,
data=0x7efc08021ea0) at defaults.c:3090
#11 0x00007efc153ba69e in notify (this=0x7efc08020270, event=<value optimized
out>, data=0x7efc08021ea0) at io-stats.c:3991
#12 0x00007efc2423f592 in xlator_notify (xl=0x7efc08020270, event=1,
data=0x7efc08021ea0) at xlator.c:566
#13 0x00007efc242c07c7 in default_notify (this=0x7efc08021ea0, event=1,
data=0x7efc08021ea0) at defaults.c:3090
#14 0x00007efc2423f592 in xlator_notify (xl=0x7efc08021ea0, event=1,
data=0x7efc08021ea0) at xlator.c:566
#15 0x00007efc24279bae in glusterfs_graph_parent_up (graph=<value optimized
out>) at graph.c:442
#16 0x00007efc24279fb2 in glusterfs_graph_activate (graph=0x7efc08003990,
ctx=0x7eb1a0) at graph.c:711
#17 0x00007efc23bc9f91 in glfs_process_volfp (fs=<value optimized out>,
fp=0x7efc08003710) at glfs-mgmt.c:79
#18 0x00007efc23bca3aa in glfs_mgmt_getspec_cbk (req=<value optimized out>,
iov=<value optimized out>, count=<value optimized out>, myframe=0x7efc080012d0)
at glfs-mgmt.c:665
#19 0x00007efc23df2ad5 in rpc_clnt_handle_reply (clnt=0x876540,
pollin=0x7efc080028c0) at rpc-clnt.c:793
#20 0x00007efc23df3c85 in rpc_clnt_notify (trans=<value optimized
out>, mydata=0x876570, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7efc080028c0)
at rpc-clnt.c:986
#21 0x00007efc23deed68 in rpc_transport_notify (this=<value optimized out>,
event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:538
#22 0x00007efc16fd19bd in socket_event_poll_in (this=0x876740) at socket.c:2268
#23 0x00007efc16fd2cbe in socket_event_handler (fd=<value optimized out>,
idx=<value optimized out>, data=0x876740, poll_in=1, poll_out=0, poll_err=0) at
socket.c:2398
#24 0x00007efc242a0716 in event_dispatch_epoll_handler (data=0x7efc10000920) at
event-epoll.c:572
#25 event_dispatch_epoll_worker (data=0x7efc10000920) at event-epoll.c:675
#26 0x00007efc2351daa1 in start_thread (arg=0x7efc17be0700) at
pthread_create.c:301
#27 0x00007efc2326abcd in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:115
#0 0x00007f5f9a28956c in __gf_rdma_teardown (this=0x7f5f94032780) at
rdma.c:3255
(gdb) list
3250 gf_rdma_peer_t *peer = NULL;
3251
3252 priv = this->private;
3253 peer = &priv->peer;
3254
3255 if (peer->cm_id->qp != NULL) {
3256 __gf_rdma_destroy_qp (this);
3257 }
3258
3259 if (!list_empty (&priv->peer.ioq)) {
(gdb) print peer->cm_id
$4 = (struct rdma_cm_id *) 0x0
In my opinion, It is caused by wrong exception handling at __gf_rdma_teardown()
(https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/rdma/src/rdma.c#L3256).
If an error has occured with rdma_create_id() in gf_rdma_connect(), process
will jump to the 'unlock' label and then call gf_rdma_teardown() which call
__gf_rdma_teardown().
Presently, __gf_rdma_teardown() checks InifiniBand QP with peer->cm_id->qp!
Unfortunately, cm_id is not allocated and will be crushed in this situation :)
I attach ugly patch for resolving this issue!
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list