<div dir="ltr">Forgot to mention, sometimes I have to do force start other volumes as well, its hard to determine which brick process is locked up from the logs. <div><br></div><div><div><br></div><div>Status of volume: rhev_vms_primary</div><div>Gluster process TCP Port RDMA Port Online Pid</div><div>------------------------------------------------------------------------------</div><div>Brick spidey.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary 0 49157 Y 15666</div><div>Brick deadpool.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary 0 49156 Y 2542 </div><div>Brick groot.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary 0 49156 Y 2180 </div><div>Self-heal Daemon on localhost N/A N/A N N/A << Brick process is not running on any node.</div><div>Self-heal Daemon on spidey.ib.runlevelone.lan N/A N/A N N/A </div><div>Self-heal Daemon on groot.ib.runlevelone.lan N/A N/A N N/A </div><div> </div><div>Task Status of Volume rhev_vms_primary</div><div>------------------------------------------------------------------------------</div><div>There are no active volume tasks</div><div><br></div><div><br></div><div><div> 3081 gluster volume start rhev_vms_noshards force</div><div> 3082 gluster volume status</div><div> 3083 gluster volume start rhev_vms_primary force</div><div> 3084 gluster volume status</div><div> 3085 gluster volume start rhev_vms_primary rhev_vms</div><div> 3086 gluster volume start rhev_vms_primary rhev_vms force</div></div><div><br></div></div><div><div>Status of volume: rhev_vms_primary</div><div>Gluster process TCP Port RDMA Port Online Pid</div><div>------------------------------------------------------------------------------</div><div>Brick spidey.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary 0 49157 Y 15666</div><div>Brick deadpool.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary 0 49156 Y 2542 </div><div>Brick groot.ib.runlevelone.lan:/gluster/brick/rhev_vms_primary 0 49156 Y 2180 </div><div>Self-heal Daemon on localhost N/A N/A Y 8343 </div><div>Self-heal Daemon on spidey.ib.runlevelone.lan N/A N/A Y 22381</div><div>Self-heal Daemon on groot.ib.runlevelone.lan N/A N/A Y 20633</div></div><div><br></div><div>Finally..</div><div><br></div><div>Dan</div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, May 29, 2018 at 8:47 PM, Dan Lavu <span dir="ltr"><<a href="mailto:dan@redhat.com" target="_blank">dan@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Stefan, <div><br></div><div>Sounds like a brick process is not running. I have notice some strangeness in my lab when using RDMA, I often have to forcibly restart the brick process, often as in every single time I do a major operation, add a new volume, remove a volume, stop a volume, etc.</div><div><br></div><div>gluster volume status <vol> </div><div><br></div><div>Does any of the self heal daemons show N/A? If that's the case, try forcing a restart on the volume. </div><div><br></div><div>gluster volume start <vol> force</div><div><br></div><div>This will also explain why your volumes aren't being replicated properly. </div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, May 29, 2018 at 5:20 PM, Stefan Solbrig <span dir="ltr"><<a href="mailto:stefan.solbrig@ur.de" target="_blank">stefan.solbrig@ur.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dear all,<br>
<br>
I faced a problem with a glusterfs volume (pure distributed, _not_ dispersed) over RDMA transport. One user had a directory with a large number of files (50,000 files) and just doing an "ls" in this directory yields a "Transport endpoint not connected" error. The effect is, that "ls" only shows some files, but not all. <br>
<br>
The respective log file shows this error message:<br>
<br>
[2018-05-20 20:38:25.114978] W [MSGID: 114031] [client-rpc-fops.c:2578:client<wbr>3_3_readdirp_cbk] 0-glurch-client-0: remote operation failed [Transport endpoint is not connected]<br>
[2018-05-20 20:38:27.732796] W [MSGID: 103046] [rdma.c:4089:gf_rdma_process_r<wbr>ecv] 0-rpc-transport/rdma: peer (<a href="http://10.100.245.18:49153" rel="noreferrer" target="_blank">10.100.245.18:49153</a>), couldn't encode or decode the msg properly or write chunks were not provided for replies that were bigger than RDMA_INLINE_THRESHOLD (2048)<br>
[2018-05-20 20:38:27.732844] W [MSGID: 114031] [client-rpc-fops.c:2578:client<wbr>3_3_readdirp_cbk] 0-glurch-client-3: remote operation failed [Transport endpoint is not connected]<br>
[2018-05-20 20:38:27.733181] W [fuse-bridge.c:2897:fuse_readd<wbr>irp_cbk] 0-glusterfs-fuse: 72882828: READDIRP => -1 (Transport endpoint is not connected)<br>
<br>
I already set the memlock limit for glusterd to unlimited, but the problem persists. <br>
<br>
Only going from RDMA transport to TCP transport solved the problem. (I'm running the volume now in mixed mode, config.transport=tcp,rdma). Mounting with transport=rdma shows this error, mouting with transport=tcp is fine.<br>
<br>
however, this problem does not arise on all large directories, not on all. I didn't recognize a pattern yet. <br>
<br>
I'm using glusterfs v3.12.6 on the servers, QDR Infiniband HCAs . <br>
<br>
Is this a known issue with RDMA transport?<br>
<br>
best wishes,<br>
Stefan<br>
<br>
______________________________<wbr>_________________<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
<a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/mailm<wbr>an/listinfo/gluster-users</a><br>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>