[Gluster-users] Issues with glustershd with release 8.4 and 9.1

Thu May 27 17:39:41 UTC 2021

Hi Marco,

Thank you for opening the issue. I'll check the log contents and get back
to you.

On Thu, May 27, 2021 at 10:50 PM Marco Fais <evilmf at gmail.com> wrote:

> Srijan
>
> thanks a million -- I have opened the issue as requested here:
>
> https://github.com/gluster/glusterfs/issues/2492
>
> I have attached the glusterd.log and glustershd.log files, but please let
> me know if there is any other test I should do or logs I should provide.
>
>
> Thanks,
> Marco
>
>
> On Wed, 26 May 2021 at 18:09, Srijan Sivakumar <ssivakum at redhat.com>
> wrote:
>
>> Hi Marco,
>>
>> If possible, let's open an issue in github and track this from there. I
>> am checking the previous mails in the chain to see if I can infer something
>> about the situation. It would be helpful if we could analyze this with the
>> help of log files. Especially glusterd.log and glustershd.log.
>>
>> To open an issue, you can use this link : Open a new issue
>> <https://github.com/gluster/glusterfs/issues/new>
>>
>> On Wed, May 26, 2021 at 5:02 PM Marco Fais <evilmf at gmail.com> wrote:
>>
>>> Ravi,
>>>
>>> thanks a million.
>>> @Mohit, @Srijan please let me know if you need any additional
>>> information.
>>>
>>> Thanks,
>>> Marco
>>>
>>>
>>> On Tue, 25 May 2021 at 17:28, Ravishankar N <ravishankar at redhat.com>
>>> wrote:
>>>
>>>> Hi Marco,
>>>> I haven't had any luck yet.  Adding Mohit and Srijan who work in
>>>> glusterd in case they have some inputs.
>>>> -Ravi
>>>>
>>>>
>>>> On Tue, May 25, 2021 at 9:31 PM Marco Fais <evilmf at gmail.com> wrote:
>>>>
>>>>> Hi Ravi
>>>>>
>>>>> just wondering if you have any further thoughts on this --
>>>>> unfortunately it is something still very much affecting us at the moment.
>>>>> I am trying to understand how to troubleshoot it further but haven't
>>>>> been able to make much progress...
>>>>>
>>>>> Thanks,
>>>>> Marco
>>>>>
>>>>>
>>>>> On Thu, 20 May 2021 at 19:04, Marco Fais <evilmf at gmail.com> wrote:
>>>>>
>>>>>> Just to complete...
>>>>>>
>>>>>> from the FUSE mount log on server 2 I see the same errors as in
>>>>>> glustershd.log on node 1:
>>>>>>
>>>>>> [2021-05-20 17:58:34.157971 +0000] I [MSGID: 114020]
>>>>>> [client.c:2319:notify] 0-VM_Storage_1-client-11: parent translators are
>>>>>> ready, attempting connect on transport []
>>>>>> [2021-05-20 17:58:34.160586 +0000] I
>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-VM_Storage_1-client-11: changing port
>>>>>> to 49170 (from 0)
>>>>>> [2021-05-20 17:58:34.160608 +0000] I [socket.c:849:__socket_shutdown]
>>>>>> 0-VM_Storage_1-client-11: intentional socket shutdown(20)
>>>>>> [2021-05-20 17:58:34.161403 +0000] I [MSGID: 114046]
>>>>>> [client-handshake.c:857:client_setvolume_cbk] 0-VM_Storage_1-client-10:
>>>>>> Connected, attached to remote volume [{conn-name=VM_Storage_1-client-10},
>>>>>> {remote_subvol=/bricks/vm_b3_vol/brick}]
>>>>>> [2021-05-20 17:58:34.161513 +0000] I [MSGID: 108002]
>>>>>> [afr-common.c:6435:afr_notify] 0-VM_Storage_1-replicate-3: Client-quorum is
>>>>>> met
>>>>>> [2021-05-20 17:58:34.162043 +0000] I [MSGID: 114020]
>>>>>> [client.c:2319:notify] 0-VM_Storage_1-client-13: parent translators are
>>>>>> ready, attempting connect on transport []
>>>>>> [2021-05-20 17:58:34.162491 +0000] I
>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-VM_Storage_1-client-12: changing port
>>>>>> to 49170 (from 0)
>>>>>> [2021-05-20 17:58:34.162507 +0000] I [socket.c:849:__socket_shutdown]
>>>>>> 0-VM_Storage_1-client-12: intentional socket shutdown(26)
>>>>>> [2021-05-20 17:58:34.163076 +0000] I [MSGID: 114057]
>>>>>> [client-handshake.c:1128:select_server_supported_programs]
>>>>>> 0-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1},
>>>>>> {Num=1298437}, {Version=400}]
>>>>>> [2021-05-20 17:58:34.163339 +0000] W [MSGID: 114043]
>>>>>> [client-handshake.c:727:client_setvolume_cbk] 0-VM_Storage_1-client-11:
>>>>>> failed to set the volume [{errno=2}, {error=No such file or directory}]
>>>>>> [2021-05-20 17:58:34.163351 +0000] W [MSGID: 114007]
>>>>>> [client-handshake.c:752:client_setvolume_cbk] 0-VM_Storage_1-client-11:
>>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid
>>>>>> argument}]
>>>>>> [2021-05-20 17:58:34.163360 +0000] E [MSGID: 114044]
>>>>>> [client-handshake.c:757:client_setvolume_cbk] 0-VM_Storage_1-client-11:
>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2},
>>>>>> {error=No such file or directory}]
>>>>>> [2021-05-20 17:58:34.163365 +0000] I [MSGID: 114051]
>>>>>> [client-handshake.c:879:client_setvolume_cbk] 0-VM_Storage_1-client-11:
>>>>>> sending CHILD_CONNECTING event []
>>>>>> [2021-05-20 17:58:34.163425 +0000] I [MSGID: 114018]
>>>>>> [client.c:2229:client_rpc_notify] 0-VM_Storage_1-client-11: disconnected
>>>>>> from client, process will keep trying to connect glusterd until brick's
>>>>>> port is available [{conn-name=VM_Storage_1-client-11}]
>>>>>>
>>>>>> On Thu, 20 May 2021 at 18:54, Marco Fais <evilmf at gmail.com> wrote:
>>>>>>
>>>>>>> HI Ravi,
>>>>>>>
>>>>>>> thanks again for your help.
>>>>>>>
>>>>>>> Here is the output of "cat
>>>>>>> graphs/active/VM_Storage_1-client-11/private" from the same node
>>>>>>> where glustershd is complaining:
>>>>>>>
>>>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>>>>> fd.0.remote_fd = 1
>>>>>>> ------ = ------
>>>>>>> granted-posix-lock[0] = owner = 7904e87d91693fb7, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 100, l_len = 1
>>>>>>> granted-posix-lock[1] = owner = 7904e87d91693fb7, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 101, l_len = 1
>>>>>>> granted-posix-lock[2] = owner = 7904e87d91693fb7, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 103, l_len = 1
>>>>>>> granted-posix-lock[3] = owner = 7904e87d91693fb7, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 201, l_len = 1
>>>>>>> granted-posix-lock[4] = owner = 7904e87d91693fb7, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 203, l_len = 1
>>>>>>> ------ = ------
>>>>>>> fd.1.remote_fd = 0
>>>>>>> ------ = ------
>>>>>>> granted-posix-lock[0] = owner = b43238094746d9fe, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 100, l_len = 1
>>>>>>> granted-posix-lock[1] = owner = b43238094746d9fe, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 201, l_len = 1
>>>>>>> granted-posix-lock[2] = owner = b43238094746d9fe, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 203, l_len = 1
>>>>>>> ------ = ------
>>>>>>> fd.2.remote_fd = 3
>>>>>>> ------ = ------
>>>>>>> granted-posix-lock[0] = owner = 53526588c515153b, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 100, l_len = 1
>>>>>>> granted-posix-lock[1] = owner = 53526588c515153b, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 201, l_len = 1
>>>>>>> granted-posix-lock[2] = owner = 53526588c515153b, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 203, l_len = 1
>>>>>>> ------ = ------
>>>>>>> fd.3.remote_fd = 2
>>>>>>> ------ = ------
>>>>>>> granted-posix-lock[0] = owner = 889461581e4fda22, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 100, l_len = 1
>>>>>>> granted-posix-lock[1] = owner = 889461581e4fda22, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 101, l_len = 1
>>>>>>> granted-posix-lock[2] = owner = 889461581e4fda22, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 103, l_len = 1
>>>>>>> granted-posix-lock[3] = owner = 889461581e4fda22, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 201, l_len = 1
>>>>>>> granted-posix-lock[4] = owner = 889461581e4fda22, cmd = F_SETLK
>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
>>>>>>> F_RDLCK, l_start = 203, l_len = 1
>>>>>>> ------ = ------
>>>>>>> connected = 1
>>>>>>> total_bytes_read = 6665235356
>>>>>>> ping_timeout = 42
>>>>>>> total_bytes_written = 4756303549
>>>>>>> ping_msgs_sent = 3662
>>>>>>> msgs_sent = 16786186
>>>>>>>
>>>>>>> So they seem to be connected there.
>>>>>>> *However* -- they are not connected apparently in server 2 (where I
>>>>>>> have just re-mounted the volume):
>>>>>>> [root at lab-cnvirt-h02 .meta]# cat
>>>>>>> graphs/active/VM_Storage_1-client-11/private
>>>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>>>>> *connected = 0*
>>>>>>> total_bytes_read = 50020
>>>>>>> ping_timeout = 42
>>>>>>> total_bytes_written = 84628
>>>>>>> ping_msgs_sent = 0
>>>>>>> msgs_sent = 0
>>>>>>> [root at lab-cnvirt-h02 .meta]# cat
>>>>>>> graphs/active/VM_Storage_1-client-20/private
>>>>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv]
>>>>>>> *connected = 0*
>>>>>>> total_bytes_read = 53300
>>>>>>> ping_timeout = 42
>>>>>>> total_bytes_written = 90180
>>>>>>> ping_msgs_sent = 0
>>>>>>> msgs_sent = 0
>>>>>>>
>>>>>>> The other bricks look connected...
>>>>>>>
>>>>>>> Regards,
>>>>>>> Marco
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 20 May 2021 at 14:02, Ravishankar N <ravishankar at redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Marco,
>>>>>>>>
>>>>>>>> On Wed, May 19, 2021 at 8:02 PM Marco Fais <evilmf at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Ravi,
>>>>>>>>>
>>>>>>>>> thanks a million for your reply.
>>>>>>>>>
>>>>>>>>> I have replicated the issue in my test cluster by bringing one of
>>>>>>>>> the nodes down, and then up again.
>>>>>>>>> The glustershd process in the restarted node is now complaining
>>>>>>>>> about connectivity to two bricks in one of my volumes:
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> [2021-05-19 14:05:14.462133 +0000] I
>>>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-11: changing port
>>>>>>>>> to 49170 (from 0)
>>>>>>>>> [2021-05-19 14:05:14.464971 +0000] I [MSGID: 114057]
>>>>>>>>> [client-handshake.c:1128:select_server_supported_programs]
>>>>>>>>> 2-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1},
>>>>>>>>> {Num=1298437}, {Version=400}]
>>>>>>>>> [2021-05-19 14:05:14.465209 +0000] W [MSGID: 114043]
>>>>>>>>> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-11:
>>>>>>>>> failed to set the volume [{errno=2}, {error=No such file or directory}]
>>>>>>>>> [2021-05-19 14:05:14.465236 +0000] W [MSGID: 114007]
>>>>>>>>> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-11:
>>>>>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid
>>>>>>>>> argument}]
>>>>>>>>> [2021-05-19 14:05:14.465248 +0000] E [MSGID: 114044]
>>>>>>>>> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-11:
>>>>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2},
>>>>>>>>> {error=No such file or directory}]
>>>>>>>>> [2021-05-19 14:05:14.465256 +0000] I [MSGID: 114051]
>>>>>>>>> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-11:
>>>>>>>>> sending CHILD_CONNECTING event []
>>>>>>>>> [2021-05-19 14:05:14.465291 +0000] I [MSGID: 114018]
>>>>>>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-11: disconnected
>>>>>>>>> from client, process will keep trying to connect glusterd until brick's
>>>>>>>>> port is available [{conn-name=VM_Storage_1-client-11}]
>>>>>>>>> [2021-05-19 14:05:14.473598 +0000] I
>>>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-20: changing port
>>>>>>>>> to 49173 (from 0)
>>>>>>>>>
>>>>>>>>
>>>>>>>> The above logs indicate that shd is trying to connect to the bricks
>>>>>>>> on ports 49170 and 49173 respectively, when it should have done so
>>>>>>>> using 49172 and 49169 (as per the volume status and ps output). Shd gets
>>>>>>>> the brick port numbers info from glusterd, so I'm not sure what is going on
>>>>>>>> here.  Do you have fuse mounts on this particular node?  If you don't, you
>>>>>>>> can mount it temporarily, then check if the connection to the bricks is
>>>>>>>> successful from the .meta folder of the mount:
>>>>>>>>
>>>>>>>> cd /path-to-fuse-mount
>>>>>>>> cd .meta
>>>>>>>> cat graphs/active/VM_Storage_1-client-11/private
>>>>>>>> cat graphs/active/VM_Storage_1-client-20/private
>>>>>>>> etc. and check if connected=1 or 0.
>>>>>>>>
>>>>>>>> I just wanted to see if it is only the shd or even the other
>>>>>>>> clients are unable to connect to the bricks from this node. FWIW, I tried
>>>>>>>> upgrading from 7.9 to 8.4 on a test machine and the shd was able to connect
>>>>>>>> to the bricks just fine.
>>>>>>>> Regards,
>>>>>>>> Ravi
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> [2021-05-19 14:05:14.476543 +0000] I [MSGID: 114057]
>>>>>>>>> [client-handshake.c:1128:select_server_supported_programs]
>>>>>>>>> 2-VM_Storage_1-client-20: Using Program [{Program-name=GlusterFS 4.x v1},
>>>>>>>>> {Num=1298437}, {Version=400}]
>>>>>>>>> [2021-05-19 14:05:14.476764 +0000] W [MSGID: 114043]
>>>>>>>>> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-20:
>>>>>>>>> failed to set the volume [{errno=2}, {error=No such file or directory}]
>>>>>>>>> [2021-05-19 14:05:14.476785 +0000] W [MSGID: 114007]
>>>>>>>>> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-20:
>>>>>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid
>>>>>>>>> argument}]
>>>>>>>>> [2021-05-19 14:05:14.476799 +0000] E [MSGID: 114044]
>>>>>>>>> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-20:
>>>>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2},
>>>>>>>>> {error=No such file or directory}]
>>>>>>>>> [2021-05-19 14:05:14.476812 +0000] I [MSGID: 114051]
>>>>>>>>> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-20:
>>>>>>>>> sending CHILD_CONNECTING event []
>>>>>>>>> [2021-05-19 14:05:14.476849 +0000] I [MSGID: 114018]
>>>>>>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-20: disconnected
>>>>>>>>> from client, process will keep trying to connect glusterd until brick's
>>>>>>>>> port is available [{conn-name=VM_Storage_1-client-20}]
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> The two bricks are the following:
>>>>>>>>> VM_Storage_1-client-20 --> Brick21:
>>>>>>>>> lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick (arbiter)
>>>>>>>>> VM_Storage_1-client-11 --> Brick12:
>>>>>>>>> lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick (arbiter)
>>>>>>>>> (In this case it the issue is on two arbiter nodes, but it is not
>>>>>>>>> always the case)
>>>>>>>>>
>>>>>>>>> The port information via "gluster volume status VM_Storage_1" on
>>>>>>>>> the affected node (same as the one running the glustershd reporting the
>>>>>>>>> issue) is:
>>>>>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick
>>>>>>>>>                     *49172     *0          Y       3978256
>>>>>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick
>>>>>>>>>                     *49169     *0          Y       3978224
>>>>>>>>>
>>>>>>>>> This is aligned to the actual port of the process:
>>>>>>>>> root     3978256  1.5  0.0 1999568 30372 ?       Ssl  May18  15:56
>>>>>>>>> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id
>>>>>>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick -p
>>>>>>>>> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid
>>>>>>>>> -S /var/run/gluster/2b1dd3ca06d39a59.socket --brick-name
>>>>>>>>> /bricks/vm_b5_arb/brick -l
>>>>>>>>> /var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log --xlator-option
>>>>>>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name
>>>>>>>>> brick --brick-port *49172 *--xlator-option
>>>>>>>>> VM_Storage_1-server.listen-port=*49172*
>>>>>>>>> root     3978224  4.3  0.0 1867976 27928 ?       Ssl  May18  44:55
>>>>>>>>> /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id
>>>>>>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick -p
>>>>>>>>> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid
>>>>>>>>> -S /var/run/gluster/00d461b7d79badc9.socket --brick-name
>>>>>>>>> /bricks/vm_b3_arb/brick -l
>>>>>>>>> /var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log --xlator-option
>>>>>>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name
>>>>>>>>> brick --brick-port *49169 *--xlator-option
>>>>>>>>> VM_Storage_1-server.listen-port=*49169*
>>>>>>>>>
>>>>>>>>> So the issue seems to be specifically on glustershd, as the *glusterd
>>>>>>>>> process seems to be aware of the right port *(as it matches the
>>>>>>>>> real port, and the brick is indeed up according to the status).
>>>>>>>>>
>>>>>>>>> I have then requested a statedump as you have suggested, and the
>>>>>>>>> bricks seem to be not connected:
>>>>>>>>>
>>>>>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>>>>>>> *connected=0*
>>>>>>>>> total_bytes_read=341120
>>>>>>>>> ping_timeout=42
>>>>>>>>> total_bytes_written=594008
>>>>>>>>> ping_msgs_sent=0
>>>>>>>>> msgs_sent=0
>>>>>>>>>
>>>>>>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv]
>>>>>>>>> *connected=0*
>>>>>>>>> total_bytes_read=341120
>>>>>>>>> ping_timeout=42
>>>>>>>>> total_bytes_written=594008
>>>>>>>>> ping_msgs_sent=0
>>>>>>>>> msgs_sent=0
>>>>>>>>>
>>>>>>>>> The important other thing to notice is that normally the bricks
>>>>>>>>> that are not connecting are always in the same (remote) node... i.e. they
>>>>>>>>> are both in node 3 in this case. That seems to be always the case, I have
>>>>>>>>> not encountered a scenario where bricks from different nodes are reporting
>>>>>>>>> this issue (at least for the same volume).
>>>>>>>>>
>>>>>>>>> Please let me know if you need any additional info.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Marco
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, 19 May 2021 at 06:31, Ravishankar N <
>>>>>>>>> ravishankar at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, May 17, 2021 at 4:22 PM Marco Fais <evilmf at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I am having significant issues with glustershd with releases 8.4
>>>>>>>>>>> and 9.1.
>>>>>>>>>>>
>>>>>>>>>>> My oVirt clusters are using gluster storage backends, and were
>>>>>>>>>>> running fine with Gluster 7.x (shipped with earlier versions of oVirt Node
>>>>>>>>>>> 4.4.x). Recently the oVirt project moved to Gluster 8.4 for the nodes, and
>>>>>>>>>>> hence I have moved to this release when upgrading my clusters.
>>>>>>>>>>>
>>>>>>>>>>> Since then I am having issues whenever one of the nodes is
>>>>>>>>>>> brought down; when the nodes come back up online the bricks are typically
>>>>>>>>>>> back up and working, but some (random) glustershd processes in the various
>>>>>>>>>>> nodes seem to have issues connecting to some of them.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> When the issue happens, can you check if the TCP port number of
>>>>>>>>>> the brick (glusterfsd) processes displayed in `gluster volume status`
>>>>>>>>>> matches with that of the actual port numbers observed (i.e. the
>>>>>>>>>> --brick-port argument) when you run `ps aux | grep glusterfsd` ? If they
>>>>>>>>>> don't match, then glusterd has incorrect brick port information in its
>>>>>>>>>> memory and serving it to glustershd. Restarting glusterd instead of
>>>>>>>>>> (killing the bricks + `volume start force`) should fix it, although we need
>>>>>>>>>> to find why glusterd serves incorrect port numbers.
>>>>>>>>>>
>>>>>>>>>> If they do match, then can you take a statedump of glustershd to
>>>>>>>>>> check that it is indeed disconnected from the bricks? You will
>>>>>>>>>> need to verify that 'connected=1' in the statedump. See "Self-heal is
>>>>>>>>>> stuck/ not getting completed." section in
>>>>>>>>>> https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/.
>>>>>>>>>> Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will be
>>>>>>>>>> generated in the /var/run/gluster/ directory.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Ravi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>
>> --
>> Regards,
>> Srijan
>>
>

-- 
Regards,
Srijan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210527/919f10d5/attachment.html>