<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr">Hi Ravi,<div><br></div><div>thanks a million for your reply.</div><div><br></div><div>I have replicated the issue in my test cluster by bringing one of the nodes down, and then up again.</div><div>The glustershd process in the restarted node is now complaining about connectivity to two bricks in one of my volumes:</div><div><br></div><div>---</div><div><font face="monospace">[2021-05-19 14:05:14.462133 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-11: changing port to 49170 (from 0)<br>[2021-05-19 14:05:14.464971 +0000] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 2-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}] <br>[2021-05-19 14:05:14.465209 +0000] W [MSGID: 114043] [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-11: failed to set the volume [{errno=2}, {error=No such file or directory}] <br>[2021-05-19 14:05:14.465236 +0000] W [MSGID: 114007] [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-11: failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid argument}] <br>[2021-05-19 14:05:14.465248 +0000] E [MSGID: 114044] [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-11: SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, {error=No such file or directory}] <br>[2021-05-19 14:05:14.465256 +0000] I [MSGID: 114051] [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-11: sending CHILD_CONNECTING event [] <br>[2021-05-19 14:05:14.465291 +0000] I [MSGID: 114018] [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-11: disconnected from client, process will keep trying to connect glusterd until brick's port is available [{conn-name=VM_Storage_1-client-11}] <br>[2021-05-19 14:05:14.473598 +0000] I [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-20: changing port to 49173 (from 0)<br>[2021-05-19 14:05:14.476543 +0000] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 2-VM_Storage_1-client-20: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}] <br>[2021-05-19 14:05:14.476764 +0000] W [MSGID: 114043] [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-20: failed to set the volume [{errno=2}, {error=No such file or directory}] <br>[2021-05-19 14:05:14.476785 +0000] W [MSGID: 114007] [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-20: failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid argument}] <br>[2021-05-19 14:05:14.476799 +0000] E [MSGID: 114044] [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-20: SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, {error=No such file or directory}] <br>[2021-05-19 14:05:14.476812 +0000] I [MSGID: 114051] [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-20: sending CHILD_CONNECTING event [] <br>[2021-05-19 14:05:14.476849 +0000] I [MSGID: 114018] [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-20: disconnected from client, process will keep trying to connect glusterd until brick's port is available [{conn-name=VM_Storage_1-client-20}] </font><br></div></div></div></div><div>---</div><div><br></div><div>The two bricks are the following:</div><div><font face="monospace">


VM_Storage_1-client-20 --> Brick21: lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick (arbiter)<br></font></div><div><font face="monospace">VM_Storage_1-client-11 --> Brick12: lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick (arbiter)<br></font></div><div>(In this case it the issue is on two arbiter nodes, but it is not always the case)</div><div><br></div><div>The port information via "gluster volume status VM_Storage_1" on the affected node (same as the one running the glustershd reporting the issue) is:</div><div><font face="monospace">Brick lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick                                   <b>49172     </b>0          Y       3978256</font><br></div><div><font face="monospace">Brick lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick                                   <b>49169     </b>0          Y       3978224<br></font></div><div><br></div><div>This is aligned to the actual port of the process:</div><div><font face="monospace">root     3978256  1.5  0.0 1999568 30372 ?       Ssl  May18  15:56 /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick -p /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid -S /var/run/gluster/2b1dd3ca06d39a59.socket --brick-name /bricks/vm_b5_arb/brick -l /var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log --xlator-option *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name brick --brick-port <b>49172 </b>--xlator-option VM_Storage_1-server.listen-port=<b>49172</b></font><br></div><div><font face="monospace">root     3978224  4.3  0.0 1867976 27928 ?       Ssl  May18  44:55 /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick -p /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid -S /var/run/gluster/00d461b7d79badc9.socket --brick-name /bricks/vm_b3_arb/brick -l /var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log --xlator-option *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name brick --brick-port <b>49169 </b>--xlator-option VM_Storage_1-server.listen-port=<b>49169</b><br></font></div><div><br></div><div>So the issue seems to be specifically on glustershd, as the <b>glusterd process seems to be aware of the right port </b>(as it matches the real port, and the brick is indeed up according to the status).</div><div><br></div><div>I have then requested a statedump as you have suggested, and the bricks seem to be not connected:</div><div><br></div><div><font face="monospace">[xlator.protocol.client.VM_Storage_1-client-11.priv]<br><b>connected=0</b><br>total_bytes_read=341120<br>ping_timeout=42<br>total_bytes_written=594008<br>ping_msgs_sent=0<br>msgs_sent=0<br></font></div><div><font face="monospace"><br></font></div><div><font face="monospace">[xlator.protocol.client.VM_Storage_1-client-20.priv]<br><b>connected=0</b><br>total_bytes_read=341120<br>ping_timeout=42<br>total_bytes_written=594008<br>ping_msgs_sent=0<br>msgs_sent=0</font><br></div><div><br></div><div>The important other thing to notice is that normally the bricks that are not connecting are always in the same (remote) node... i.e. they are both in node 3 in this case. That seems to be always the case, I have not encountered a scenario where bricks from different nodes are reporting this issue (at least for the same volume).</div><div><br></div><div>Please let me know if you need any additional info.</div><div><br></div><div>Regards,</div><div>Marco</div><div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 19 May 2021 at 06:31, Ravishankar N <<a href="mailto:ravishankar@redhat.com">ravishankar@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div style="font-size:small"><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, May 17, 2021 at 4:22 PM Marco Fais <<a href="mailto:evilmf@gmail.com" target="_blank">evilmf@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>I am having significant issues with glustershd with releases 8.4 and 9.1.</div><div><br></div><div>My oVirt clusters are using gluster storage backends, and were running fine with Gluster 7.x (shipped with earlier versions of oVirt Node 4.4.x). Recently the oVirt project moved to Gluster 8.4 for the nodes, and hence I have moved to this release when upgrading my clusters.</div><div><br></div><div>Since then I am having issues whenever one of the nodes is brought down; when the nodes come back up online the bricks are typically back up and working, but some (random) glustershd processes in the various nodes seem to have issues connecting to some of them.</div><div><br></div></div></blockquote><div><span class="gmail_default" style="font-size:small"><br></span></div><div><span class="gmail_default" style="font-size:small">When the issue happens, can you check if the TCP port number of the brick (glusterfsd) processes displayed in `gluster volume status` matches with that of the actual port numbers observed (i.e. the --brick-port argument) when you run `ps aux | grep glusterfsd` ? If they don't match, then glusterd has incorrect brick port information in its memory and serving it to glustershd. Restarting glusterd instead of (killing the bricks + `volume start force`) should fix it, although we need to find why glusterd serves incorrect port numbers. </span></div><div><br></div><div>If they do match, then <span class="gmail_default" style="font-size:small">can you </span>take a statedump of glustershd to check that it is indeed disconnected from the bricks<span class="gmail_default" style="font-size:small">? You will need to verify that 'connected=1' in the statedump. See "Self-heal is stuck/ not getting completed." section in <a href="https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/" target="_blank">https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/</a>. Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will be generated in the /var/run/gluster/ directory.</span></div><div><br></div><div>Regards,<br></div><div><span class="gmail_default" style="font-size:small">Ravi </span></div><div><span class="gmail_default" style="font-size:small"><br></span></div><div><span class="gmail_default" style="font-size:small"><br></span></div><div><span class="gmail_default" style="font-size:small"></span></div></div></div>

</blockquote></div></div>