[Gluster-users] memory allocation failure messages as false positives?

LaGarde, Owen M ERDC-RDE-ITL-MS Contractor Owen.M.LaGarde at erdc.dren.mil
Sat May 9 00:10:55 UTC 2015


Are there any typical reasons for glusterfsd falsely reporting memory allocation failure when attempting to create a new IB QP?  I'm getting a high rate of similar cases but can't push any hardware or non-gluster software error into the open.  Recovering the volume after a crash is not a problem;  what self-heal doesn't automagically handle rebalancing takes care of just fine.

Below is the glusterfs-glusterd log snippet for a typical crash.  This happens with no particular pattern on any gluster server except the first in the series (which is also the one the clients specify in their mounts and thus go to for the vol info file).  The crash may occur during a 'hello world' of 1p per node across the cluster but not do it during the final and most agressive rank of an OpenMPI All-to-All benchmark, or vice versa;  there's no particular correlation with MPI traffic load, IB/RDMA traffic pattern, client population and/or activity, etc.

In all failure cases all IPoIB, Ethernet, RDMA, and IBCV tests completed without issue and returned the appropriate bandwidth/latency/pathing.  All servers are running auditd and gmond, which show no indication of memory issues or any other failure.  All servers have run Pandora repeatedly without triggering any hardware failures.  There are no complaints in the global OpenSM instances for either IB fabric at the management points, or on the PTP SMD GUID-locked instances running on the gluster servers and talking to the backing storage controllers.

Any ideas?

---------
[2015-05-08 23:19:26.660870] C [rdma.c:2951:gf_rdma_create_qp] 0-rdma.management: rdma.management: could not create QP (Cannot allocate memory)
[2015-05-08 23:19:26.660966] W [rdma.c:818:gf_rdma_cm_handle_connect_request] 0-rdma.management: could not create QP (peer:10.149.0.63:1013 me:10.149.1.142:24008)
pending frames:
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2015-05-08 23:19:26
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.6.2
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x39e3a20136]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x39e3a3abbf]
/lib64/libc.so.6[0x39e1a326a0]
/usr/lib64/glusterfs/3.6.2/xlator/mgmt/glusterd.so(glusterd_rpcsvc_notify+0x69)[0x7fefd149ec59]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x105)[0x39e32081d5]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x1a0)[0x39e3209cd0]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x39e320b6d8]
/usr/lib64/glusterfs/3.6.2/rpc-transport/rdma.so(+0x5941)[0x7fefd0251941]
/usr/lib64/glusterfs/3.6.2/rpc-transport/rdma.so(+0xb6d9)[0x7fefd02576d9]
/lib64/libpthread.so.0[0x39e26079d1]
/lib64/libc.so.6(clone+0x6d)[0x39e1ae88fd]
---------

##
## Soft bits are:
##

RHEL 6.6
kernel 2.6.32-528.el6.bz1159925.x86_64
    (this is the 6.7 pre-release kernel with the latest ib_sm updates
     for the occasional mgroup bcast/join issues, see RH BZ)
glibc-2.12-1.149.el6_6.7.x86_64
compat-opensm-libs-3.3.5-3.el6.x86_64
opensm-3.3.17-1.el6.x86_64
opensm-libs-3.3.17-1.el6.x86_64
opensm-multifabric-0.1-sgi710r3.rhel6.x86_64
    (this is vendor stubs to do IB subnet_id and GUID specific opensm
     master/standby instances integrated with cluster management)
glusterfs-3.6.2-1.el6.x86_64
glusterfs-debuginfo-3.6.2-1.el6.x86_64
glusterfs-devel-3.6.2-1.el6.x86_64
glusterfs-libs-3.6.2-1.el6.x86_64
glusterfs-extra-xlators-3.6.2-1.el6.x86_64
glusterfs-api-devel-3.6.2-1.el6.x86_64
glusterfs-fuse-3.6.2-1.el6.x86_64
glusterfs-server-3.6.2-1.el6.x86_64
glusterfs-cli-3.6.2-1.el6.x86_64
glusterfs-api-3.6.2-1.el6.x86_64
glusterfs-rdma-3.6.2-1.el6.x86_64

Volume in question:

[root at phoenix-smc ~]# ssh service4 gluster vol info home
Warning: No xauth data; using fake authentication data for X11 forwarding.

Volume Name: home
Type: Distribute
Volume ID: f03fcaf0-3889-45ac-a06a-a4d60d5a673d
Status: Started
Number of Bricks: 28
Transport-type: rdma
Bricks:
Brick1: service4-ib1:/mnt/l1_s4_ost0000_0000/brick
Brick2: service4-ib1:/mnt/l1_s4_ost0001_0001/brick
Brick3: service4-ib1:/mnt/l1_s4_ost0002_0002/brick
Brick4: service5-ib1:/mnt/l1_s5_ost0003_0003/brick
Brick5: service5-ib1:/mnt/l1_s5_ost0004_0004/brick
Brick6: service5-ib1:/mnt/l1_s5_ost0005_0005/brick
Brick7: service5-ib1:/mnt/l1_s5_ost0006_0006/brick
Brick8: service6-ib1:/mnt/l1_s6_ost0007_0007/brick
Brick9: service6-ib1:/mnt/l1_s6_ost0008_0008/brick
Brick10: service6-ib1:/mnt/l1_s6_ost0009_0009/brick
Brick11: service7-ib1:/mnt/l1_s7_ost000a_0010/brick
Brick12: service7-ib1:/mnt/l1_s7_ost000b_0011/brick
Brick13: service7-ib1:/mnt/l1_s7_ost000c_0012/brick
Brick14: service7-ib1:/mnt/l1_s7_ost000d_0013/brick
Brick15: service10-ib1:/mnt/l1_s10_ost000e_0014/brick
Brick16: service10-ib1:/mnt/l1_s10_ost000f_0015/brick
Brick17: service10-ib1:/mnt/l1_s10_ost0010_0016/brick
Brick18: service11-ib1:/mnt/l1_s11_ost0011_0017/brick
Brick19: service11-ib1:/mnt/l1_s11_ost0012_0018/brick
Brick20: service11-ib1:/mnt/l1_s11_ost0013_0019/brick
Brick21: service11-ib1:/mnt/l1_s11_ost0014_0020/brick
Brick22: service12-ib1:/mnt/l1_s12_ost0015_0021/brick
Brick23: service12-ib1:/mnt/l1_s12_ost0016_0022/brick
Brick24: service12-ib1:/mnt/l1_s12_ost0017_0023/brick
Brick25: service13-ib1:/mnt/l1_s13_ost0018_0024/brick
Brick26: service13-ib1:/mnt/l1_s13_ost0019_0025/brick
Brick27: service13-ib1:/mnt/l1_s13_ost001a_0026/brick
Brick28: service13-ib1:/mnt/l1_s13_ost001b_0027/brick
Options Reconfigured:
performance.stat-prefetch: off
[root at phoenix-smc ~]# ssh service4 gluster vol status home
Warning: No xauth data; using fake authentication data for X11 forwarding.
Status of volume: home
Gluster process                        Port    Online    Pid
------------------------------------------------------------------------------
Brick service4-ib1:/mnt/l1_s4_ost0000_0000/brick    49156    Y    8028
Brick service4-ib1:/mnt/l1_s4_ost0001_0001/brick    49157    Y    8040
Brick service4-ib1:/mnt/l1_s4_ost0002_0002/brick    49158    Y    8052
Brick service5-ib1:/mnt/l1_s5_ost0003_0003/brick    49163    Y    6526
Brick service5-ib1:/mnt/l1_s5_ost0004_0004/brick    49164    Y    6533
Brick service5-ib1:/mnt/l1_s5_ost0005_0005/brick    49165    Y    6540
Brick service5-ib1:/mnt/l1_s5_ost0006_0006/brick    49166    Y    6547
Brick service6-ib1:/mnt/l1_s6_ost0007_0007/brick    49155    Y    8027
Brick service6-ib1:/mnt/l1_s6_ost0008_0008/brick    49156    Y    8039
Brick service6-ib1:/mnt/l1_s6_ost0009_0009/brick    49157    Y    8051
Brick service7-ib1:/mnt/l1_s7_ost000a_0010/brick    49160    Y    9067
Brick service7-ib1:/mnt/l1_s7_ost000b_0011/brick    49161    Y    9074
Brick service7-ib1:/mnt/l1_s7_ost000c_0012/brick    49162    Y    9081
Brick service7-ib1:/mnt/l1_s7_ost000d_0013/brick    49163    Y    9088
Brick service10-ib1:/mnt/l1_s10_ost000e_0014/brick    49155    Y    8108
Brick service10-ib1:/mnt/l1_s10_ost000f_0015/brick    49156    Y    8120
Brick service10-ib1:/mnt/l1_s10_ost0010_0016/brick    49157    Y    8132
Brick service11-ib1:/mnt/l1_s11_ost0011_0017/brick    49160    Y    8070
Brick service11-ib1:/mnt/l1_s11_ost0012_0018/brick    49161    Y    8082
Brick service11-ib1:/mnt/l1_s11_ost0013_0019/brick    49162    Y    8094
Brick service11-ib1:/mnt/l1_s11_ost0014_0020/brick    49163    Y    8106
Brick service12-ib1:/mnt/l1_s12_ost0015_0021/brick    49155    Y    8072
Brick service12-ib1:/mnt/l1_s12_ost0016_0022/brick    49156    Y    8084
Brick service12-ib1:/mnt/l1_s12_ost0017_0023/brick    49157    Y    8096
Brick service13-ib1:/mnt/l1_s13_ost0018_0024/brick    49156    Y    8156
Brick service13-ib1:/mnt/l1_s13_ost0019_0025/brick    49157    Y    8168
Brick service13-ib1:/mnt/l1_s13_ost001a_0026/brick    49158    Y    8180
Brick service13-ib1:/mnt/l1_s13_ost001b_0027/brick    49159    Y    8192
NFS Server on localhost                    2049    Y    8065
NFS Server on service6-ib1                2049    Y    8064
NFS Server on service13-ib1                2049    Y    8205
NFS Server on service11-ib1                2049    Y    11833
NFS Server on service12-ib1                2049    Y    8109
NFS Server on service10-ib1                2049    Y    8145
NFS Server on service5-ib1                2049    Y    6554
NFS Server on service7-ib1                2049    Y    15140

Task Status of Volume home
------------------------------------------------------------------------------
Task                 : Rebalance
ID                   : 88f1e627-c7cc-40fc-b4a8-7672a6151712
Status               : completed

[root at phoenix-smc ~]#

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150509/a3b2f5c3/attachment.html>


More information about the Gluster-users mailing list