[Gluster-users] Issues with glustershd with release 8.4 and 9.1

Wed May 19 05:30:52 UTC 2021

On Mon, May 17, 2021 at 4:22 PM Marco Fais <evilmf at gmail.com> wrote:

> Hi,
>
> I am having significant issues with glustershd with releases 8.4 and 9.1.
>
> My oVirt clusters are using gluster storage backends, and were running
> fine with Gluster 7.x (shipped with earlier versions of oVirt Node 4.4.x).
> Recently the oVirt project moved to Gluster 8.4 for the nodes, and hence I
> have moved to this release when upgrading my clusters.
>
> Since then I am having issues whenever one of the nodes is brought down;
> when the nodes come back up online the bricks are typically back up and
> working, but some (random) glustershd processes in the various nodes seem
> to have issues connecting to some of them.
>
>
When the issue happens, can you check if the TCP port number of the brick
(glusterfsd) processes displayed in `gluster volume status` matches with
that of the actual port numbers observed (i.e. the --brick-port argument)
when you run `ps aux | grep glusterfsd` ? If they don't match, then
glusterd has incorrect brick port information in its memory and serving it
to glustershd. Restarting glusterd instead of (killing the bricks + `volume
start force`) should fix it, although we need to find why glusterd serves
incorrect port numbers.

If they do match, then can you take a statedump of glustershd to check that
it is indeed disconnected from the bricks? You will need to verify that
'connected=1' in the statedump. See "Self-heal is stuck/ not getting
completed." section in
https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/.
Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will be
generated in the /var/run/gluster/ directory.

Regards,
Ravi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210519/63d9466b/attachment.html>