<div dir="ltr">Thanks again, <div>I have tried to run a find over the cluster to try and trigger self-healing, but it&#39;s very slow so I don&#39;t have it running right now. </div><div>If I check the same &quot;ls /brick/folder&quot; on all bricks, it takes less than 0.01 sec so I don&#39;t think any individual brick is causing the problem, performance on each brick seems to be normal. </div><div>I think the issue is somewhere in the gluster internal communication as I believe FUSE mounted clients will try to communicate with all bricks. Unfortunately, I am not sure how to confirm this or narrow this down. <br></div><div>Really struggling with this one now, it&#39;s starting to significantly impact our operations. I&#39;m not sure what else I can try so appreciate any suggestions. </div><div><br></div><div>Thank you, </div><div>- Patrick</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Apr 21, 2019 at 11:50 PM Strahil &lt;<a href="mailto:hunter86_bg@yahoo.com">hunter86_bg@yahoo.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><p dir="ltr">Usually when this happens I run &#39;/find  /fuse/mount/point -exec stat {} \;&#39; from a client (using gluster with oVirt).<br>

Yet, my scale is multiple times smaller  and I don&#39;t know how this will affect you (except it will trigger a heal).</p>

<p dir="ltr">So the round-robin of the DNS clarifies the mystery .In such case, maybe FUSE client is not the problem.Still it is worth trying a VM with the new gluster version to mount the cluster.</p>

<p dir="ltr">From the profile (took a short glance over it from my phone), not all bricks are spending much of their time in LOOKUP.<br>

Maybe your data is not evenly distributed? Is that ever possible ?<br>

Sadly you can&#39;t rebalance untill all those heals are pending.(Maybe I&#39;m wrong)</p>

<p dir="ltr">Have you checked the speed  of &#39;ls /my/brick/subdir1/&#39; on each brick ?</p>

<p dir="ltr">Sadly, I&#39;m just a gluster user, so take everything with a grain of salt.</p>

<p dir="ltr">Best Regards,<br>

Strahil Nikolov</p>

<div class="gmail-m_6369538650921111769quote">On Apr 21, 2019 18:03, Patrick Rennie &lt;<a href="mailto:patrickmrennie@gmail.com" target="_blank">patrickmrennie@gmail.com</a>&gt; wrote:<br type="attribution"><blockquote class="gmail-m_6369538650921111769quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I just tried to check my &quot;gluster volume heal gvAA01 statistics&quot; and it doesn&#39;t seem like a full heal was still in progress, just an index, I have started the full heal again and am trying to monitor it with &quot;gluster volume heal gvAA01 info&quot; which just shows me thousands of gfid file identifiers scrolling past. <div>What is the best way to check the status of a heal and track the files healed and progress to completion? </div><div><br></div><div>Thank you,</div><div>- Patrick</div></div><br><div class="gmail-m_6369538650921111769elided-text"><div dir="ltr">On Sun, Apr 21, 2019 at 10:28 PM Patrick Rennie &lt;<a href="mailto:patrickmrennie@gmail.com" target="_blank">patrickmrennie@gmail.com</a>&gt; wrote:<br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I think just worked out why NFS lookups are sometimes slow and sometimes fast as the hostname uses round robin DNS lookups, if I change to a specific host, 01-B, it&#39;s always quick, and if I change to the other brick host, 02-B, it&#39;s always slow. <div>Maybe that will help to narrow this down? </div></div><br><div class="gmail-m_6369538650921111769elided-text"><div dir="ltr">On Sun, Apr 21, 2019 at 10:24 PM Patrick Rennie &lt;<a href="mailto:patrickmrennie@gmail.com" target="_blank">patrickmrennie@gmail.com</a>&gt; wrote:<br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div>Hi Strahil, </div><div><br></div><div>Thank you for your reply and your suggestions. I&#39;m not sure which logs would be most relevant to be checking to diagnose this issue, we have the brick logs, the cluster mount logs, the shd logs or something else? I have posted a few that I have seen repeated a few times already. I will continue to post anything further that I see. </div><div>I am working on migrating data to some new storage, so this will slowly free up space, although this is a production cluster and new data is being uploaded every day, sometimes faster than I can migrate it off. I have several other similar clusters and none of them have the same problem, one the others is actually at 98-99% right now (big problem, I know) but still performs perfectly fine compared to this cluster, I am not sure low space is the root cause here. </div><div><br></div><div>I currently have 13 VMs accessing this cluster, I have checked each one and all of them use one of the two options below to mount the cluster in fstab</div><div><br></div><div>HOSTNAME:/gvAA01   /mountpoint    glusterfs       defaults,_netdev,rw,log-level=WARNING,direct-io-mode=disable,use-readdirp=no    0 0</div><div>HOSTNAME:/gvAA01   /mountpoint    glusterfs       defaults,_netdev,rw,log-level=WARNING,direct-io-mode=disable</div><div><br></div><div>I also have a few other VMs which use NFS to access the cluster, and these machines appear to be significantly quicker, initially I get a similar delay with NFS but if I cancel the first &quot;ls&quot; and try it again I get &lt; 1 sec lookups, this can take over 10 minutes by FUSE/gluster client, but the same trick of cancelling and trying again doesn&#39;t work for FUSE/gluster. Sometimes the NFS queries have no delay at all, so this is a bit strange to me. </div><div>HOSTNAME:/gvAA01        /mountpoint/ nfs defaults,_netdev,vers=3,async,noatime 0 0</div><div><br></div><div>Example:</div><div>user@VM:~$ time ls /cluster/folder</div><div>^C</div><div><br></div><div>real    9m49.383s</div><div>user    0m0.001s</div><div>sys     0m0.010s</div><div><br></div><div>user@VM:~$ time ls /cluster/folder</div><div>&lt;results&gt;</div><div><br></div><div>real    0m0.069s</div><div>user    0m0.001s</div><div>sys     0m0.007s</div><div><br></div><div>---</div><div><br></div><div>I have checked the profiling as you suggested, I let it run for around a minute, then cancelled it and saved the profile info. </div><div><br></div><div>root@HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01 start</div><div>Starting volume profile on gvAA01 has been successful</div><div>root@HOSTNAME:/var/log/glusterfs# time ls /cluster/folder</div><div>^C</div><div><br></div><div>real    1m1.660s</div><div>user    0m0.000s</div><div>sys     0m0.002s</div><div><br></div><div>root@HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01 info &gt;&gt; ~/profile.txt</div><div>root@HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01 stop</div><div><br></div><div>I will attach the results to this email as it&#39;s o</div></div></div></blockquote></div></blockquote></div></blockquote></div></blockquote></div>