[Gluster-users] Issues with glustershd with release 8.4 and 9.1

> Hi,
> I am having significant issues with glustershd with releases 8.4 and 9.1.
> My oVirt clusters are using gluster storage backends, and were running
> fine with Gluster 7.x (shipped with earlier versions of oVirt Node 4.4.x).
> Recently the oVirt project moved to Gluster 8.4 for the nodes, and hence I
> have moved to this release when upgrading my clusters.
> Since then I am having issues whenever one of the nodes is brought down;
> when the nodes come back up online the bricks are typically back up and
> working, but some (random) glustershd processes in the various nodes seem
> to have issues connecting to some of them.
When the issue happens, can you check if the TCP port number of the brick
(glusterfsd) processes displayed in `gluster volume status` matches with
that of the actual port numbers observed (i.e. the --brick-port argument)
when you run `ps aux | grep glusterfsd` ? If they don't match, then
glusterd has incorrect brick port information in its memory and serving it
to glustershd. Restarting glusterd instead of (killing the bricks + `volume
start force`) should fix it, although we need to find why glusterd serves
incorrect port numbers.

If they do match, then can you take a statedump of glustershd to check that
it is indeed disconnected from the bricks? You will need to verify that
'connected=1' in the statedump. See "Self-heal is stuck/ not getting
completed." section in
Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will be
generated in the /var/run/gluster/ directory.

