[Gluster-users] glusterd SIGSEGV crash when create volume with transport=rdma

Mike Lykov combr at ya.ru
Wed Nov 7 11:01:44 UTC 2018


Hi All!

I'm try to use ovirt virtualisation platform with GlusterFS storage and Intel Omni-Path "Infiniband" interfaces.
All packages version 3.12 from ovirt-4.2 repository, but I tried also gluster 4.1 from Centos centos-release-gluster41 repository.
Host are Centos 7.5. 
glusterd crashes with SIGSEGV. 
Are there some special configuration needed for rdma transport?

Created trusted pool:
[root at ovirtnode1 log]# gluster pool list
UUID					Hostname       	State
5a9a0a5f-12f4-48b1-bfbe-24c172adc65c	ovirtstor5	Connected
41350da9-c944-41c5-afdc-46ff51ab93f6	ovirtstor6	Connected
0f50175e-7e47-4839-99c7-c7ced21f090c	localhost      	Connected
(this from 172.16.100.1, ovirtstor5 peer is a 172.16.100.5, ovirtstor6 is a 172.16.100.6)

Creating Volume (Success):

gluster volume create data_rdma replica 3 transport rdma ovirtstor1:/gluster_bricks/data_rdma/data_rdma ovirtstor5:/gluster_bricks/data_rdma/data_rdma ovirtstor6:/gluster_bricks/data_rdma/data_rdma
volume create: data_rdma: success: please start the volume to access data

glusterd.log (UTC Time, local time zone are UTC+4)
[2018-11-07 09:52:43.106185] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.12.15/xlator/mgmt/glusterd.so(+0xdf50a) [0x7f3423e4350a] -->/usr/lib64/glusterfs/3.12.15/xlator/mgmt/glusterd.so(+0xdefcd) [0x7f3423e42fcd] -->/lib64/libglus
[2018-11-07 09:52:57.825351] I [MSGID: 106488] [glusterd-handler.c:1548:__glusterd_handle_cli_get_volume] 0-management: Received get vol req
[2018-11-07 09:53:19.119450] I [glusterd-utils.c:6056:glusterd_brick_start] 0-management: starting a fresh brick process for brick /gluster_bricks/data_rdma/data_rdma
[2018-11-07 09:53:19.186374] I [MSGID: 106143] [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick /gluster_bricks/data_rdma/data_rdma.rdma on port 49155

Status (All Online):
[root at ovirtnode1 /]# gluster volume status data_rdma
Status of volume: data_rdma
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick ovirtstor1:/gluster_bricks/data_rdma/
data_rdma                                   0         49155      Y       156176
Brick ovirtstor5:/gluster_bricks/data_rdma/
data_rdma                                   0         49155      Y       47958
Brick ovirtstor6:/gluster_bricks/data_rdma/
data_rdma                                   0         49155      Y       18911
Self-heal Daemon on localhost               N/A       N/A        Y       156206
Self-heal Daemon on ovirtstor5.miac         N/A       N/A        Y       47994
Self-heal Daemon on ovirtstor6.miac         N/A       N/A        Y       18947

After 3 minutes:
[2018-11-07 09:56:08.957536] C [MSGID: 103021] [rdma.c:3263:gf_rdma_create_qp] 0-rdma.management: rdma.management: could not create QP [Отказано в доступе]
[2018-11-07 09:56:08.957986] W [MSGID: 103021] [rdma.c:1049:gf_rdma_cm_handle_connect_request] 0-rdma.management: could not create QP (peer:172.16.100.5:49151 me:172.16.100.1:24008)
pending frames:
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-11-07 09:56:08
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.15
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f342f2f54e0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f342f2ff414]
/lib64/libc.so.6(+0x362f0)[0x7f342d9552f0]
/usr/lib64/glusterfs/3.12.15/xlator/mgmt/glusterd.so(+0x160c4)[0x7f3423d7a0c4]
/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x10f)[0x7f342f0b584f]
/lib64/libgfrpc.so.0(rpcsvc_notify+0xc0)[0x7f342f0b7f20]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f342f0b9ea3]
/usr/lib64/glusterfs/3.12.15/rpc-transport/rdma.so(+0x4fef)[0x7f341fba8fef]
/usr/lib64/glusterfs/3.12.15/rpc-transport/rdma.so(+0x7c20)[0x7f341fbabc20]
/lib64/libpthread.so.0(+0x7e25)[0x7f342e154e25]
/lib64/libc.so.6(clone+0x6d)[0x7f342da1dbad]

Glusterd are listening only tcp/24007 on all nodes, but why? Therefore connect to 172.16.100.1:24008 are failed?


On peer node (syslog 'messages') :
Nov  7 13:53:24 ovirtnode5 glustershd[47994]: [2018-11-07 09:53:24.570701] C [MSGID: 103021] [rdma.c:3263:gf_rdma_create_qp] 0-data_rdma-client-0: data_rdma-client-0: could not c
reate QP [Отказано в доступе]
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: [2018-11-07 09:56:09.988118] C [MSGID: 103021] [rdma.c:3263:gf_rdma_create_qp] 0-rdma.management: rdma.management: could not create QP
 [Отказано в доступе]
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: pending frames:
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: patchset: git://git.gluster.org/glusterfs.git
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: signal received: 11
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: time of crash:
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: 2018-11-07 09:56:09
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: configuration details:
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: argp 1
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: backtrace 1
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: dlfcn 1
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: libpthread 1
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: llistxattr 1
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: setfsid 1
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: spinlock 1
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: epoll.h 1
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: xattr.h 1
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: st_atim.tv_nsec 1
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: package-string: glusterfs 3.12.15
Nov  7 13:56:09 ovirtnode5 glusterd[15657]: ---------
Nov  7 13:56:10 ovirtnode5 abrt-hook-ccpp: Process 15657 (glusterfsd) of user 0 killed by SIGSEGV - dumping core

ABRT show this:
[root at ovirtnode1 glusterfs]# abrt-cli list
id 7b7b53a92fe3f26271fd9f9012d1d0d011d94773
reason:         glusterfsd killed by SIGSEGV
time:           Ср 07 ноя 2018 13:56:09
cmdline:        /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
package:        glusterfs-fuse-3.12.15-1.el7
uid:            0 (root)
count:          1
Directory:      /var/tmp/abrt/ccpp-2018-11-07-13:56:09-3145
Отправлено:     https://retrace.fedoraproject.org/faf/reports/bthash/badd77dc4fa0d04f686a4b3366e262d1140fdb55

Code (I don't know what version/release it is, found in github)
https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/rdma/src/rdma.c
  ret = rdma_create_qp(peer->cm_id, device->pd, &init_attr);
    if (ret != 0) {
        gf_msg(peer->trans->name, GF_LOG_CRITICAL, errno,
               RDMA_MSG_CREAT_QP_FAILED, "%s: could not create QP", this->name);
ret = -1;
 .srq = device->srq,


RDMA on its own seems working:
[root at ovirtnode5 log]# ib_write_bw -D 30 --cpu_util ovirtstor1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : hfi1_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x04 QPN 0x00ae PSN 0x15933a RKey 0x60181900 VAddr 0x007fde76ee6000
 remote address: LID 0x03 QPN 0x0056 PSN 0x7758e RKey 0x40101100 VAddr 0x007fde37488000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]    CPU_Util[%]
Conflicting CPU frequency values detected: 3692.431000 != 3109.112000. CPU Frequency is not max.
 65536      1646300          0.00               6430.36		   0.102886	    1.40
---------------------------------------------------------------------------------------

Info about hardware & driver
[root at ovirtnode1 glusterfs]# hfi1_control -i
Driver Version: 10.8-0
Driver SrcVersion: AFDD1BF17512A67B217EB47
Opa Version: 10.8.0.0.204
0: BoardId: Intel Corporation Omni-Path HFI Silicon 100 Series [integrated]
0: Version: ChipABI 3.0, ChipRev 7.17, SW Compat 3
0: ChipSerial: 0x011aeeea
0,1: Status: 5: LinkUp 4: ACTIVE
0,1: LID=0x3 GUID=0011:7509:011a:eeea

[root at ovirtnode1 glusterfs]# opainfo
hfi1_0:1                           PortGID:0xfe80000000000000:00117509011aeeea
   PortState:     Active
   LinkSpeed      Act: 25Gb         En: 25Gb        
   LinkWidth      Act: 4            En: 4           
   LinkWidthDnGrd ActTx: 4  Rx: 4   En: 3,4         
   LCRC           Act: 14-bit       En: 14-bit,16-bit,48-bit       Mgmt: True 
   LID: 0x00000003-0x00000003       SM LID: 0x00000003 SL: 0 
   Xmit Data:               6752 MB Pkts:              9628972
   Recv Data:             217461 MB Pkts:             60540469
   Link Quality: 5 (Excellent)


More information about the Gluster-users mailing list