[Gluster-users] RDMA inline threshold?

Dan Lavu dan at redhat.com
Wed May 30 01:00:26 UTC 2018


Forgot to mention, sometimes I have to do force start other volumes as
well, its hard to determine which brick process is locked up from the logs.


Status of volume: rhev_vms_primary
Gluster process
                      TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick spidey.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary
     0         49157      Y       15666
Brick deadpool.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary
   0         49156      Y       2542
Brick groot.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary
     0         49156      Y       2180
Self-heal Daemon on localhost
                  N/A       N/A        N       N/A  << Brick process is not
running on any node.
Self-heal Daemon on spidey.ib.runlevelone.lan
           N/A       N/A        N       N/A
Self-heal Daemon on groot.ib.runlevelone.lan
           N/A       N/A        N       N/A

Task Status of Volume rhev_vms_primary
------------------------------------------------------------------------------
There are no active volume tasks


 3081  gluster volume start rhev_vms_noshards force
 3082  gluster volume status
 3083  gluster volume start rhev_vms_primary force
 3084  gluster volume status
 3085  gluster volume start rhev_vms_primary rhev_vms
 3086  gluster volume start rhev_vms_primary rhev_vms force

Status of volume: rhev_vms_primary
Gluster process
                         TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick spidey.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary
        0         49157      Y       15666
Brick deadpool.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary
      0         49156      Y       2542
Brick groot.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary
        0         49156      Y       2180
Self-heal Daemon on localhost
                     N/A       N/A        Y       8343
Self-heal Daemon on spidey.ib.runlevelone.lan
              N/A       N/A        Y       22381
Self-heal Daemon on groot.ib.runlevelone.lan
              N/A       N/A        Y       20633

Finally..

Dan




On Tue, May 29, 2018 at 8:47 PM, Dan Lavu <dan at redhat.com> wrote:

> Stefan,
>
> Sounds like a brick process is not running. I have notice some strangeness
> in my lab when using RDMA, I often have to forcibly restart the brick
> process, often as in every single time I do a major operation, add a new
> volume, remove a volume, stop a volume, etc.
>
> gluster volume status <vol>
>
> Does any of the self heal daemons show N/A? If that's the case, try
> forcing a restart on the volume.
>
> gluster volume start <vol> force
>
> This will also explain why your volumes aren't being replicated properly.
>
> On Tue, May 29, 2018 at 5:20 PM, Stefan Solbrig <stefan.solbrig at ur.de>
> wrote:
>
>> Dear all,
>>
>> I faced a problem with a glusterfs volume (pure distributed, _not_
>> dispersed) over RDMA transport.  One user had a directory with a large
>> number of files (50,000 files) and just doing an "ls" in this directory
>> yields a "Transport endpoint not connected" error. The effect is, that "ls"
>> only shows some files, but not all.
>>
>> The respective log file shows this error message:
>>
>> [2018-05-20 20:38:25.114978] W [MSGID: 114031]
>> [client-rpc-fops.c:2578:client3_3_readdirp_cbk] 0-glurch-client-0:
>> remote operation failed [Transport endpoint is not connected]
>> [2018-05-20 20:38:27.732796] W [MSGID: 103046]
>> [rdma.c:4089:gf_rdma_process_recv] 0-rpc-transport/rdma: peer (
>> 10.100.245.18:49153), couldn't encode or decode the msg properly or
>> write chunks were not provided for replies that were bigger than
>> RDMA_INLINE_THRESHOLD (2048)
>> [2018-05-20 20:38:27.732844] W [MSGID: 114031]
>> [client-rpc-fops.c:2578:client3_3_readdirp_cbk] 0-glurch-client-3:
>> remote operation failed [Transport endpoint is not connected]
>> [2018-05-20 20:38:27.733181] W [fuse-bridge.c:2897:fuse_readdirp_cbk]
>> 0-glusterfs-fuse: 72882828: READDIRP => -1 (Transport endpoint is not
>> connected)
>>
>> I already set the memlock limit for glusterd to unlimited, but the
>> problem persists.
>>
>> Only going from RDMA transport to TCP transport solved the problem.  (I'm
>> running the volume now in mixed mode, config.transport=tcp,rdma).  Mounting
>> with transport=rdma shows this error, mouting with transport=tcp is fine.
>>
>> however, this problem does not arise on all large directories, not on
>> all. I didn't recognize a pattern yet.
>>
>> I'm using glusterfs v3.12.6 on the servers, QDR Infiniband HCAs .
>>
>> Is this a known issue with RDMA transport?
>>
>> best wishes,
>> Stefan
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180529/a54d142c/attachment.html>


More information about the Gluster-users mailing list