[Gluster-devel] Does RDMA support flow control in GlusterFS now

Sat Dec 23 09:44:27 UTC 2017

We use Mellanox Infiniband card to create an IB cluster. There are several
storage nodes, and more than 20 clients. GlusterFS version is 3.11. The
storage OS is CentOS 6.5, and the client OS is CentOS 7.3. Previously we
used IP over IB and everything was OK. After we use RDMA, we get higher
bandwidth, but we often see some brick disconnecting messages in client
logs, and we can't see abnormal things in brick logs at the same time.
Although all bricks are reconnected finally, this problem leads to some
serious problems. For example, it takes several minutes to run a simple
"ls" or "df" command.

Here is an example of brick disconnecting log on one client:
[2017-12-21 10:45:47.476597] C
[rpc-clnt-ping.c:186:rpc_clnt_ping_timer_expired] 0-data-client-129: server
10.0.0.35:49204 has not responded in the last 60 seconds,
disconnecting.(trans1:0,trans2:0)
[2017-12-21 10:45:47.478820] I [MSGID: 114018]
[client.c:2285:client_rpc_notify] 0-data-client-129: disconnected from
data-client-129. Client process will keep trying to connect to glusterd
until brick's port is available
[2017-12-21 10:45:47.479267] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f565546230b] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f56552279fe] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f5655227b0e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f5655229280] (-->
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f5655229d30] )))))
0-data-client-129: forced unwinding frame type(GlusterFS 3.3)
op(ENTRYLK(31)) called at 2017-12-23 10:43:52.887616 (xid=0x9da3f5)
[2017-12-21 10:45:47.479317] E [MSGID: 114031]
[client-rpc-fops.c:1646:client3_3_entrylk_cbk] 0-data-client-129: remote
operation failed [Transport endpoint is not connected]
[2017-12-21 10:45:47.479352] E [MSGID: 108007]
[afr-lk-common.c:825:afr_unlock_entrylk_cbk] 0-data-replicate-64:
/data/a3581.data: unlock failed on data-client-129 [Transport endpoint is
not connected]
[2017-12-21 10:45:47.479718] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f565546230b] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f56552279fe] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f5655227b0e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f5655229280] (-->
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f5655229d30] )))))
0-data-client-129: forced unwinding frame type(GlusterFS 3.3)
op(LOOKUP(27)) called at 2017-12-23 10:43:53.249305 (xid=0x9da3f6)
[2017-12-21 10:45:47.479771] W [MSGID: 114031]
[client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data-client-129: remote
operation failed. Path: /data/b07869.data
(fe89d36e-16b8-4b06-bd36-69023217db9f) [Transport endpoint is not connected]
[2017-12-21 10:45:47.480644] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f565546230b] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f56552279fe] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f5655227b0e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f5655229280] (-->
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f5655229d30] )))))
0-data-client-129: forced unwinding frame type(GF-DUMP) op(NULL(2)) called
at 2017-12-23 10:44:47.468222 (xid=0x9da3f7)
[2017-12-21 10:45:47.480682] W [rpc-clnt-ping.c:243:rpc_clnt_ping_cbk]
0-data-client-129: socket disconnected
[2017-12-21 10:45:47.481046] W [MSGID: 114031]
[client-rpc-fops.c:2928:client3_3_lookup_cbk] 0-data-client-129: remote
operation failed. Path: (null) (00000000-0000-0000-0000-000000000000)
[Transport endpoint is not connected]
[2017-12-21 10:45:58.497609] I [rpc-clnt.c:2000:rpc_clnt_reconfig]
0-data-client-129: changing port to 49204 (from 0)
[2017-12-21 10:45:58.512289] I [MSGID: 114057]
[client-handshake.c:1451:select_server_supported_programs]
0-data-client-129: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-12-21 10:45:58.517383] I [MSGID: 114046]
[client-handshake.c:1216:client_setvolume_cbk] 0-data-client-129: Connected
to data-client-129, attached to remote volume '/disks/xnuyUF3N/brick'..

We find the the hw counter of  rq_num_rnr for IB card in some clients is
very big:
#cat /sys/class/infiniband/mlx4_0/ports/1/hw_counters/rq_num_rnr
943004905

And the corresponding value in storage node is also big:
cat /sys/class/infiniband/mlx4_0/ports/1/hw_counters/rq_num_rnr
23193068

If we use IP over IB, the counter value is 0. And on some clients in which
we don't see brick disconnecting problems, we also see zero value of
rq_num_rnr.

We guess it's a flow control problem of RDMA. One side sends data so fast
and the other side can't receive them in time, and then rq_num_rnr
increases.

Does RDMA support flow control in GlusterFS now?

And can we adjust these macros defined in rdma.h to avoid this problem?

/* Additional attributes */#define GF_RDMA_TIMEOUT
14#define GF_RDMA_RETRY_CNT              7#define GF_RDMA_RNR_RETRY
          7
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20171223/2b884ae7/attachment.html>