[Gluster-users] glusterd SIGSEGV crash when create volume with transport=rdma
Mike Lykov
combr at ya.ru
Wed Nov 7 11:01:44 UTC 2018
Hi All!
I'm try to use ovirt virtualisation platform with GlusterFS storage and Intel Omni-Path "Infiniband" interfaces.
All packages version 3.12 from ovirt-4.2 repository, but I tried also gluster 4.1 from Centos centos-release-gluster41 repository.
Host are Centos 7.5.
glusterd crashes with SIGSEGV.
Are there some special configuration needed for rdma transport?
Created trusted pool:
[root at ovirtnode1 log]# gluster pool list
UUID Hostname State
5a9a0a5f-12f4-48b1-bfbe-24c172adc65c ovirtstor5 Connected
41350da9-c944-41c5-afdc-46ff51ab93f6 ovirtstor6 Connected
0f50175e-7e47-4839-99c7-c7ced21f090c localhost Connected
(this from 172.16.100.1, ovirtstor5 peer is a 172.16.100.5, ovirtstor6 is a 172.16.100.6)
Creating Volume (Success):
gluster volume create data_rdma replica 3 transport rdma ovirtstor1:/gluster_bricks/data_rdma/data_rdma ovirtstor5:/gluster_bricks/data_rdma/data_rdma ovirtstor6:/gluster_bricks/data_rdma/data_rdma
volume create: data_rdma: success: please start the volume to access data
glusterd.log (UTC Time, local time zone are UTC+4)
[2018-11-07 09:52:43.106185] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.12.15/xlator/mgmt/glusterd.so(+0xdf50a) [0x7f3423e4350a] -->/usr/lib64/glusterfs/3.12.15/xlator/mgmt/glusterd.so(+0xdefcd) [0x7f3423e42fcd] -->/lib64/libglus
[2018-11-07 09:52:57.825351] I [MSGID: 106488] [glusterd-handler.c:1548:__glusterd_handle_cli_get_volume] 0-management: Received get vol req
[2018-11-07 09:53:19.119450] I [glusterd-utils.c:6056:glusterd_brick_start] 0-management: starting a fresh brick process for brick /gluster_bricks/data_rdma/data_rdma
[2018-11-07 09:53:19.186374] I [MSGID: 106143] [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick /gluster_bricks/data_rdma/data_rdma.rdma on port 49155
Status (All Online):
[root at ovirtnode1 /]# gluster volume status data_rdma
Status of volume: data_rdma
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick ovirtstor1:/gluster_bricks/data_rdma/
data_rdma 0 49155 Y 156176
Brick ovirtstor5:/gluster_bricks/data_rdma/
data_rdma 0 49155 Y 47958
Brick ovirtstor6:/gluster_bricks/data_rdma/
data_rdma 0 49155 Y 18911
Self-heal Daemon on localhost N/A N/A Y 156206
Self-heal Daemon on ovirtstor5.miac N/A N/A Y 47994
Self-heal Daemon on ovirtstor6.miac N/A N/A Y 18947
After 3 minutes:
[2018-11-07 09:56:08.957536] C [MSGID: 103021] [rdma.c:3263:gf_rdma_create_qp] 0-rdma.management: rdma.management: could not create QP [Отказано в доступе]
[2018-11-07 09:56:08.957986] W [MSGID: 103021] [rdma.c:1049:gf_rdma_cm_handle_connect_request] 0-rdma.management: could not create QP (peer:172.16.100.5:49151 me:172.16.100.1:24008)
pending frames:
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-11-07 09:56:08
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.15
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f342f2f54e0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f342f2ff414]
/lib64/libc.so.6(+0x362f0)[0x7f342d9552f0]
/usr/lib64/glusterfs/3.12.15/xlator/mgmt/glusterd.so(+0x160c4)[0x7f3423d7a0c4]
/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x10f)[0x7f342f0b584f]
/lib64/libgfrpc.so.0(rpcsvc_notify+0xc0)[0x7f342f0b7f20]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f342f0b9ea3]
/usr/lib64/glusterfs/3.12.15/rpc-transport/rdma.so(+0x4fef)[0x7f341fba8fef]
/usr/lib64/glusterfs/3.12.15/rpc-transport/rdma.so(+0x7c20)[0x7f341fbabc20]
/lib64/libpthread.so.0(+0x7e25)[0x7f342e154e25]
/lib64/libc.so.6(clone+0x6d)[0x7f342da1dbad]
Glusterd are listening only tcp/24007 on all nodes, but why? Therefore connect to 172.16.100.1:24008 are failed?
On peer node (syslog 'messages') :
Nov 7 13:53:24 ovirtnode5 glustershd[47994]: [2018-11-07 09:53:24.570701] C [MSGID: 103021] [rdma.c:3263:gf_rdma_create_qp] 0-data_rdma-client-0: data_rdma-client-0: could not c
reate QP [Отказано в доступе]
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: [2018-11-07 09:56:09.988118] C [MSGID: 103021] [rdma.c:3263:gf_rdma_create_qp] 0-rdma.management: rdma.management: could not create QP
[Отказано в доступе]
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: pending frames:
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: patchset: git://git.gluster.org/glusterfs.git
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: signal received: 11
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: time of crash:
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: 2018-11-07 09:56:09
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: configuration details:
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: argp 1
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: backtrace 1
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: dlfcn 1
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: libpthread 1
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: llistxattr 1
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: setfsid 1
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: spinlock 1
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: epoll.h 1
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: xattr.h 1
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: st_atim.tv_nsec 1
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: package-string: glusterfs 3.12.15
Nov 7 13:56:09 ovirtnode5 glusterd[15657]: ---------
Nov 7 13:56:10 ovirtnode5 abrt-hook-ccpp: Process 15657 (glusterfsd) of user 0 killed by SIGSEGV - dumping core
ABRT show this:
[root at ovirtnode1 glusterfs]# abrt-cli list
id 7b7b53a92fe3f26271fd9f9012d1d0d011d94773
reason: glusterfsd killed by SIGSEGV
time: Ср 07 ноя 2018 13:56:09
cmdline: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
package: glusterfs-fuse-3.12.15-1.el7
uid: 0 (root)
count: 1
Directory: /var/tmp/abrt/ccpp-2018-11-07-13:56:09-3145
Отправлено: https://retrace.fedoraproject.org/faf/reports/bthash/badd77dc4fa0d04f686a4b3366e262d1140fdb55
Code (I don't know what version/release it is, found in github)
https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/rdma/src/rdma.c
ret = rdma_create_qp(peer->cm_id, device->pd, &init_attr);
if (ret != 0) {
gf_msg(peer->trans->name, GF_LOG_CRITICAL, errno,
RDMA_MSG_CREAT_QP_FAILED, "%s: could not create QP", this->name);
ret = -1;
.srq = device->srq,
RDMA on its own seems working:
[root at ovirtnode5 log]# ib_write_bw -D 30 --cpu_util ovirtstor1
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : hfi1_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x04 QPN 0x00ae PSN 0x15933a RKey 0x60181900 VAddr 0x007fde76ee6000
remote address: LID 0x03 QPN 0x0056 PSN 0x7758e RKey 0x40101100 VAddr 0x007fde37488000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] CPU_Util[%]
Conflicting CPU frequency values detected: 3692.431000 != 3109.112000. CPU Frequency is not max.
65536 1646300 0.00 6430.36 0.102886 1.40
---------------------------------------------------------------------------------------
Info about hardware & driver
[root at ovirtnode1 glusterfs]# hfi1_control -i
Driver Version: 10.8-0
Driver SrcVersion: AFDD1BF17512A67B217EB47
Opa Version: 10.8.0.0.204
0: BoardId: Intel Corporation Omni-Path HFI Silicon 100 Series [integrated]
0: Version: ChipABI 3.0, ChipRev 7.17, SW Compat 3
0: ChipSerial: 0x011aeeea
0,1: Status: 5: LinkUp 4: ACTIVE
0,1: LID=0x3 GUID=0011:7509:011a:eeea
[root at ovirtnode1 glusterfs]# opainfo
hfi1_0:1 PortGID:0xfe80000000000000:00117509011aeeea
PortState: Active
LinkSpeed Act: 25Gb En: 25Gb
LinkWidth Act: 4 En: 4
LinkWidthDnGrd ActTx: 4 Rx: 4 En: 3,4
LCRC Act: 14-bit En: 14-bit,16-bit,48-bit Mgmt: True
LID: 0x00000003-0x00000003 SM LID: 0x00000003 SL: 0
Xmit Data: 6752 MB Pkts: 9628972
Recv Data: 217461 MB Pkts: 60540469
Link Quality: 5 (Excellent)
More information about the Gluster-users
mailing list