[Gluster-users] Issues with glustershd with release 8.4 and 9.1

Mon May 31 10:55:59 UTC 2021

Srijan

no problem at all -- thanks for your help. If you need any additional
information please let me know.

Regards,
Marco

On Thu, 27 May 2021 at 18:39, Srijan Sivakumar <ssivakum at redhat.com> wrote:

> Hi Marco,
>
> Thank you for opening the issue. I'll check the log contents and get back
> to you.
>
> On Thu, May 27, 2021 at 10:50 PM Marco Fais <evilmf at gmail.com> wrote:
>
>> Srijan
>>
>> thanks a million -- I have opened the issue as requested here:
>>
>> https://github.com/gluster/glusterfs/issues/2492
>>
>> I have attached the glusterd.log and glustershd.log files, but please let
>> me know if there is any other test I should do or logs I should provide.
>>
>>
>> Thanks,
>> Marco
>>
>>
>> On Wed, 26 May 2021 at 18:09, Srijan Sivakumar <ssivakum at redhat.com>
>> wrote:
>>
>>> Hi Marco,
>>>
>>> If possible, let's open an issue in github and track this from there. I
>>> am checking the previous mails in the chain to see if I can infer something
>>> about the situation. It would be helpful if we could analyze this with the
>>> help of log files. Especially glusterd.log and glustershd.log.
>>>
>>> To open an issue, you can use this link : Open a new issue
>>> <https://github.com/gluster/glusterfs/issues/new>
>>>
>>> On Wed, May 26, 2021 at 5:02 PM Marco Fais <evilmf at gmail.com> wrote:
>>>
>>>> Ravi,
>>>>
>>>> thanks a million.
>>>> @Mohit, @Srijan please let me know if you need any additional
>>>> information.
>>>>
>>>> Thanks,
>>>> Marco
>>>>
>>>>
>>>> On Tue, 25 May 2021 at 17:28, Ravishankar N <ravishankar at redhat.com>
>>>> wrote:
>>>>
>>>>> Hi Marco,
>>>>> I haven't had any luck yet.  Adding Mohit and Srijan who work in
>>>>> glusterd in case they have some inputs.
>>>>> -Ravi
>>>>>
>>>>>
>>>>> On Tue, May 25, 2021 at 9:31 PM Marco Fais <evilmf at gmail.com> wrote:
>>>>>
>>>>>> Hi Ravi
>>>>>>
>>>>>> just wondering if you have any further thoughts on this --
>>>>>> unfortunately it is something still very much affecting us at the moment.
>>>>>> I am trying to understand how to troubleshoot it further but haven't
>>>>>> been able to make much progress...
>>>>>>
>>>>>> Thanks,
>>>>>> Marco
>>>>>>
>>>>>>
>>>>>> On Thu, 20 May 2021 at 19:04, Marco Fais <evilmf at gmail.com> wrote:
>>>>>>
>>>>>>> Just to complete...
>>>>>>>
>>>>>>> from the FUSE mount log on server 2 I see the same errors as in
>>>>>>> glustershd.log on node 1:
>>>>>>>
>>>>>>> [2021-05-20 17:58:34.157971 +0000] I [MSGID: 114020]
>>>>>>> [client.c:2319:notify] 0-VM_Storage_1-client-11: parent translators are
>>>>>>> ready, attempting connect on transport []
>>>>>>> [2021-05-20 17:58:34.160586 +0000] I
>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-VM_Storage_1-client-11: changing port
>>>>>>> to 49170 (from 0)
>>>>>>> [2021-05-20 17:58:34.160608 +0000] I
>>>>>>> [socket.c:849:__socket_shutdown] 0-VM_Storage_1-client-11: intentional
>>>>>>> socket shutdown(20)
>>>>>>> [2021-05-20 17:58:34.161403 +0000] I [MSGID: 114046]
>>>>>>> [client-handshake.c:857:client_setvolume_cbk] 0-VM_Storage_1-client-10:
>>>>>>> Connected, attached to remote volume [{conn-name=VM_Storage_1-client-10},
>>>>>>> {remote_subvol=/bricks/vm_b3_vol/brick}]
>>>>>>> [2021-05-20 17:58:34.161513 +0000] I [MSGID: 108002]
>>>>>>> [afr-common.c:6435:afr_notify] 0-VM_Storage_1-replicate-3: Client-quorum is
>>>>>>> met
>>>>>>> [2021-05-20 17:58:34.162043 +0000] I [MSGID: 114020]
>>>>>>> [client.c:2319:notify] 0-VM_Storage_1-client-13: parent translators are
>>>>>>> ready, attempting connect on transport []
>>>>>>> [2021-05-20 17:58:34.162491 +0000] I
>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 0-VM_Storage_1-client-12: changing port
>>>>>>> to 49170 (from 0)
>>>>>>> [2021-05-20 17:58:34.162507 +0000] I
>>>>>>> [socket.c:849:__socket_shutdown] 0-VM_Storage_1-client-12: intentional
>>>>>>> socket shutdown(26)
>>>>>>> [2021-05-20 17:58:34.163076 +0000] I [MSGID: 114057]
>>>>>>> [client-handshake.c:1128:select_server_supported_programs]
>>>>>>> 0-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1},
>>>>>>> {Num=1298437}, {Version=400}]
>>>>>>> [2021-05-20 17:58:34.163339 +0000] W [MSGID: 114043]
>>>>>>> [client-handshake.c:727:client_setvolume_cbk] 0-VM_Storage_1-client-11:
>>>>>>> failed to set the volume [{errno=2}, {error=No such file or directory}]
>>>>>>> [2021-05-20 17:58:34.163351 +0000] W [MSGID: 114007]
>>>>>>> [client-handshake.c:752:client_setvolume_cbk] 0-VM_Storage_1-client-11:
>>>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid
>>>>>>> argument}]
>>>>>>> [2021-05-20 17:58:34.163360 +0000] E [MSGID: 114044]
>>>>>>> [client-handshake.c:757:client_setvolume_cbk] 0-VM_Storage_1-client-11:
>>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2},
>>>>>>> {error=No such file or directory}]
>>>>>>> [2021-05-20 17:58:34.163365 +0000] I [MSGID: 114051]
>>>>>>> [client-handshake.c:879:client_setvolume_cbk] 0-VM_Storage_1-client-11:
>>>>>>> sending CHILD_CONNECTING event []
>>>>>>> [2021-05-20 17:58:34.163425 +0000] I [MSGID: 114018]
>>>>>>> [client.c:2229:client_rpc_notify] 0-VM_Storage_1-client-11: disconnected
>>>>>>> from client, process will keep trying to connect glusterd until brick's
>>>>>>> port is available [{conn-name=VM_Storage_1-client-11}]
>>>>>>>
>>>>>>> On Thu, 20 May 2021 at 18:54, Marco Fais <evilmf at gmail.com> wrote:
>>>>>>>
>>>>>>>> HI Ravi,
>>>>>>>>
>>>>>>>> thanks again for your help.
>>>>>>>>
>>>>>>>> Here is the output of "cat
>>>>>>>> graphs/active/VM_Storage_1-client-11/private" from the same node
>>>>>>>> where glustershd is complaining:
>>>>>>>>
>>>>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>>>>>> fd.0.remote_fd = 1
>>>>>>>> ------ = ------
>>>>>>>> granted-posix-lock[0] = owner = 7904e87d91693fb7, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 100, l_len = 1
>>>>>>>> granted-posix-lock[1] = owner = 7904e87d91693fb7, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 101, l_len = 1
>>>>>>>> granted-posix-lock[2] = owner = 7904e87d91693fb7, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 103, l_len = 1
>>>>>>>> granted-posix-lock[3] = owner = 7904e87d91693fb7, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 201, l_len = 1
>>>>>>>> granted-posix-lock[4] = owner = 7904e87d91693fb7, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 203, l_len = 1
>>>>>>>> ------ = ------
>>>>>>>> fd.1.remote_fd = 0
>>>>>>>> ------ = ------
>>>>>>>> granted-posix-lock[0] = owner = b43238094746d9fe, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 100, l_len = 1
>>>>>>>> granted-posix-lock[1] = owner = b43238094746d9fe, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 201, l_len = 1
>>>>>>>> granted-posix-lock[2] = owner = b43238094746d9fe, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 203, l_len = 1
>>>>>>>> ------ = ------
>>>>>>>> fd.2.remote_fd = 3
>>>>>>>> ------ = ------
>>>>>>>> granted-posix-lock[0] = owner = 53526588c515153b, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 100, l_len = 1
>>>>>>>> granted-posix-lock[1] = owner = 53526588c515153b, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 201, l_len = 1
>>>>>>>> granted-posix-lock[2] = owner = 53526588c515153b, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 203, l_len = 1
>>>>>>>> ------ = ------
>>>>>>>> fd.3.remote_fd = 2
>>>>>>>> ------ = ------
>>>>>>>> granted-posix-lock[0] = owner = 889461581e4fda22, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 100, fl_end = 100, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 100, l_len = 1
>>>>>>>> granted-posix-lock[1] = owner = 889461581e4fda22, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 101, fl_end = 101, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 101, l_len = 1
>>>>>>>> granted-posix-lock[2] = owner = 889461581e4fda22, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 103, fl_end = 103, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 103, l_len = 1
>>>>>>>> granted-posix-lock[3] = owner = 889461581e4fda22, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 201, fl_end = 201, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 201, l_len = 1
>>>>>>>> granted-posix-lock[4] = owner = 889461581e4fda22, cmd = F_SETLK
>>>>>>>> fl_type = F_RDLCK, fl_start = 203, fl_end = 203, user_flock: l_type =
>>>>>>>> F_RDLCK, l_start = 203, l_len = 1
>>>>>>>> ------ = ------
>>>>>>>> connected = 1
>>>>>>>> total_bytes_read = 6665235356
>>>>>>>> ping_timeout = 42
>>>>>>>> total_bytes_written = 4756303549
>>>>>>>> ping_msgs_sent = 3662
>>>>>>>> msgs_sent = 16786186
>>>>>>>>
>>>>>>>> So they seem to be connected there.
>>>>>>>> *However* -- they are not connected apparently in server 2 (where
>>>>>>>> I have just re-mounted the volume):
>>>>>>>> [root at lab-cnvirt-h02 .meta]# cat
>>>>>>>> graphs/active/VM_Storage_1-client-11/private
>>>>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>>>>>> *connected = 0*
>>>>>>>> total_bytes_read = 50020
>>>>>>>> ping_timeout = 42
>>>>>>>> total_bytes_written = 84628
>>>>>>>> ping_msgs_sent = 0
>>>>>>>> msgs_sent = 0
>>>>>>>> [root at lab-cnvirt-h02 .meta]# cat
>>>>>>>> graphs/active/VM_Storage_1-client-20/private
>>>>>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv]
>>>>>>>> *connected = 0*
>>>>>>>> total_bytes_read = 53300
>>>>>>>> ping_timeout = 42
>>>>>>>> total_bytes_written = 90180
>>>>>>>> ping_msgs_sent = 0
>>>>>>>> msgs_sent = 0
>>>>>>>>
>>>>>>>> The other bricks look connected...
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Marco
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 20 May 2021 at 14:02, Ravishankar N <ravishankar at redhat.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Marco,
>>>>>>>>>
>>>>>>>>> On Wed, May 19, 2021 at 8:02 PM Marco Fais <evilmf at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Ravi,
>>>>>>>>>>
>>>>>>>>>> thanks a million for your reply.
>>>>>>>>>>
>>>>>>>>>> I have replicated the issue in my test cluster by bringing one of
>>>>>>>>>> the nodes down, and then up again.
>>>>>>>>>> The glustershd process in the restarted node is now complaining
>>>>>>>>>> about connectivity to two bricks in one of my volumes:
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> [2021-05-19 14:05:14.462133 +0000] I
>>>>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-11: changing port
>>>>>>>>>> to 49170 (from 0)
>>>>>>>>>> [2021-05-19 14:05:14.464971 +0000] I [MSGID: 114057]
>>>>>>>>>> [client-handshake.c:1128:select_server_supported_programs]
>>>>>>>>>> 2-VM_Storage_1-client-11: Using Program [{Program-name=GlusterFS 4.x v1},
>>>>>>>>>> {Num=1298437}, {Version=400}]
>>>>>>>>>> [2021-05-19 14:05:14.465209 +0000] W [MSGID: 114043]
>>>>>>>>>> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-11:
>>>>>>>>>> failed to set the volume [{errno=2}, {error=No such file or directory}]
>>>>>>>>>> [2021-05-19 14:05:14.465236 +0000] W [MSGID: 114007]
>>>>>>>>>> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-11:
>>>>>>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid
>>>>>>>>>> argument}]
>>>>>>>>>> [2021-05-19 14:05:14.465248 +0000] E [MSGID: 114044]
>>>>>>>>>> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-11:
>>>>>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2},
>>>>>>>>>> {error=No such file or directory}]
>>>>>>>>>> [2021-05-19 14:05:14.465256 +0000] I [MSGID: 114051]
>>>>>>>>>> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-11:
>>>>>>>>>> sending CHILD_CONNECTING event []
>>>>>>>>>> [2021-05-19 14:05:14.465291 +0000] I [MSGID: 114018]
>>>>>>>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-11: disconnected
>>>>>>>>>> from client, process will keep trying to connect glusterd until brick's
>>>>>>>>>> port is available [{conn-name=VM_Storage_1-client-11}]
>>>>>>>>>> [2021-05-19 14:05:14.473598 +0000] I
>>>>>>>>>> [rpc-clnt.c:1968:rpc_clnt_reconfig] 2-VM_Storage_1-client-20: changing port
>>>>>>>>>> to 49173 (from 0)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The above logs indicate that shd is trying to connect to the
>>>>>>>>> bricks on ports 49170 and 49173 respectively, when it should have
>>>>>>>>> done so using 49172 and 49169 (as per the volume status and ps output). Shd
>>>>>>>>> gets the brick port numbers info from glusterd, so I'm not sure what is
>>>>>>>>> going on here.  Do you have fuse mounts on this particular node?  If you
>>>>>>>>> don't, you can mount it temporarily, then check if the connection to the
>>>>>>>>> bricks is successful from the .meta folder of the mount:
>>>>>>>>>
>>>>>>>>> cd /path-to-fuse-mount
>>>>>>>>> cd .meta
>>>>>>>>> cat graphs/active/VM_Storage_1-client-11/private
>>>>>>>>> cat graphs/active/VM_Storage_1-client-20/private
>>>>>>>>> etc. and check if connected=1 or 0.
>>>>>>>>>
>>>>>>>>> I just wanted to see if it is only the shd or even the other
>>>>>>>>> clients are unable to connect to the bricks from this node. FWIW, I tried
>>>>>>>>> upgrading from 7.9 to 8.4 on a test machine and the shd was able to connect
>>>>>>>>> to the bricks just fine.
>>>>>>>>> Regards,
>>>>>>>>> Ravi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> [2021-05-19 14:05:14.476543 +0000] I [MSGID: 114057]
>>>>>>>>>> [client-handshake.c:1128:select_server_supported_programs]
>>>>>>>>>> 2-VM_Storage_1-client-20: Using Program [{Program-name=GlusterFS 4.x v1},
>>>>>>>>>> {Num=1298437}, {Version=400}]
>>>>>>>>>> [2021-05-19 14:05:14.476764 +0000] W [MSGID: 114043]
>>>>>>>>>> [client-handshake.c:727:client_setvolume_cbk] 2-VM_Storage_1-client-20:
>>>>>>>>>> failed to set the volume [{errno=2}, {error=No such file or directory}]
>>>>>>>>>> [2021-05-19 14:05:14.476785 +0000] W [MSGID: 114007]
>>>>>>>>>> [client-handshake.c:752:client_setvolume_cbk] 2-VM_Storage_1-client-20:
>>>>>>>>>> failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid
>>>>>>>>>> argument}]
>>>>>>>>>> [2021-05-19 14:05:14.476799 +0000] E [MSGID: 114044]
>>>>>>>>>> [client-handshake.c:757:client_setvolume_cbk] 2-VM_Storage_1-client-20:
>>>>>>>>>> SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2},
>>>>>>>>>> {error=No such file or directory}]
>>>>>>>>>> [2021-05-19 14:05:14.476812 +0000] I [MSGID: 114051]
>>>>>>>>>> [client-handshake.c:879:client_setvolume_cbk] 2-VM_Storage_1-client-20:
>>>>>>>>>> sending CHILD_CONNECTING event []
>>>>>>>>>> [2021-05-19 14:05:14.476849 +0000] I [MSGID: 114018]
>>>>>>>>>> [client.c:2229:client_rpc_notify] 2-VM_Storage_1-client-20: disconnected
>>>>>>>>>> from client, process will keep trying to connect glusterd until brick's
>>>>>>>>>> port is available [{conn-name=VM_Storage_1-client-20}]
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> The two bricks are the following:
>>>>>>>>>> VM_Storage_1-client-20 --> Brick21:
>>>>>>>>>> lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick (arbiter)
>>>>>>>>>> VM_Storage_1-client-11 --> Brick12:
>>>>>>>>>> lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick (arbiter)
>>>>>>>>>> (In this case it the issue is on two arbiter nodes, but it is not
>>>>>>>>>> always the case)
>>>>>>>>>>
>>>>>>>>>> The port information via "gluster volume status VM_Storage_1" on
>>>>>>>>>> the affected node (same as the one running the glustershd reporting the
>>>>>>>>>> issue) is:
>>>>>>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b5_arb/brick
>>>>>>>>>>                       *49172     *0          Y       3978256
>>>>>>>>>> Brick lab-cnvirt-h03-storage:/bricks/vm_b3_arb/brick
>>>>>>>>>>                       *49169     *0          Y       3978224
>>>>>>>>>>
>>>>>>>>>> This is aligned to the actual port of the process:
>>>>>>>>>> root     3978256  1.5  0.0 1999568 30372 ?       Ssl  May18
>>>>>>>>>>  15:56 /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id
>>>>>>>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b5_arb-brick -p
>>>>>>>>>> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b5_arb-brick.pid
>>>>>>>>>> -S /var/run/gluster/2b1dd3ca06d39a59.socket --brick-name
>>>>>>>>>> /bricks/vm_b5_arb/brick -l
>>>>>>>>>> /var/log/glusterfs/bricks/bricks-vm_b5_arb-brick.log --xlator-option
>>>>>>>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name
>>>>>>>>>> brick --brick-port *49172 *--xlator-option
>>>>>>>>>> VM_Storage_1-server.listen-port=*49172*
>>>>>>>>>> root     3978224  4.3  0.0 1867976 27928 ?       Ssl  May18
>>>>>>>>>>  44:55 /usr/sbin/glusterfsd -s lab-cnvirt-h03-storage --volfile-id
>>>>>>>>>> VM_Storage_1.lab-cnvirt-h03-storage.bricks-vm_b3_arb-brick -p
>>>>>>>>>> /var/run/gluster/vols/VM_Storage_1/lab-cnvirt-h03-storage-bricks-vm_b3_arb-brick.pid
>>>>>>>>>> -S /var/run/gluster/00d461b7d79badc9.socket --brick-name
>>>>>>>>>> /bricks/vm_b3_arb/brick -l
>>>>>>>>>> /var/log/glusterfs/bricks/bricks-vm_b3_arb-brick.log --xlator-option
>>>>>>>>>> *-posix.glusterd-uuid=a2a62dd6-49b2-4eb6-a7e2-59c75723f5c7 --process-name
>>>>>>>>>> brick --brick-port *49169 *--xlator-option
>>>>>>>>>> VM_Storage_1-server.listen-port=*49169*
>>>>>>>>>>
>>>>>>>>>> So the issue seems to be specifically on glustershd, as the *glusterd
>>>>>>>>>> process seems to be aware of the right port *(as it matches the
>>>>>>>>>> real port, and the brick is indeed up according to the status).
>>>>>>>>>>
>>>>>>>>>> I have then requested a statedump as you have suggested, and the
>>>>>>>>>> bricks seem to be not connected:
>>>>>>>>>>
>>>>>>>>>> [xlator.protocol.client.VM_Storage_1-client-11.priv]
>>>>>>>>>> *connected=0*
>>>>>>>>>> total_bytes_read=341120
>>>>>>>>>> ping_timeout=42
>>>>>>>>>> total_bytes_written=594008
>>>>>>>>>> ping_msgs_sent=0
>>>>>>>>>> msgs_sent=0
>>>>>>>>>>
>>>>>>>>>> [xlator.protocol.client.VM_Storage_1-client-20.priv]
>>>>>>>>>> *connected=0*
>>>>>>>>>> total_bytes_read=341120
>>>>>>>>>> ping_timeout=42
>>>>>>>>>> total_bytes_written=594008
>>>>>>>>>> ping_msgs_sent=0
>>>>>>>>>> msgs_sent=0
>>>>>>>>>>
>>>>>>>>>> The important other thing to notice is that normally the bricks
>>>>>>>>>> that are not connecting are always in the same (remote) node... i.e. they
>>>>>>>>>> are both in node 3 in this case. That seems to be always the case, I have
>>>>>>>>>> not encountered a scenario where bricks from different nodes are reporting
>>>>>>>>>> this issue (at least for the same volume).
>>>>>>>>>>
>>>>>>>>>> Please let me know if you need any additional info.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Marco
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, 19 May 2021 at 06:31, Ravishankar N <
>>>>>>>>>> ravishankar at redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, May 17, 2021 at 4:22 PM Marco Fais <evilmf at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I am having significant issues with glustershd with releases
>>>>>>>>>>>> 8.4 and 9.1.
>>>>>>>>>>>>
>>>>>>>>>>>> My oVirt clusters are using gluster storage backends, and were
>>>>>>>>>>>> running fine with Gluster 7.x (shipped with earlier versions of oVirt Node
>>>>>>>>>>>> 4.4.x). Recently the oVirt project moved to Gluster 8.4 for the nodes, and
>>>>>>>>>>>> hence I have moved to this release when upgrading my clusters.
>>>>>>>>>>>>
>>>>>>>>>>>> Since then I am having issues whenever one of the nodes is
>>>>>>>>>>>> brought down; when the nodes come back up online the bricks are typically
>>>>>>>>>>>> back up and working, but some (random) glustershd processes in the various
>>>>>>>>>>>> nodes seem to have issues connecting to some of them.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> When the issue happens, can you check if the TCP port number of
>>>>>>>>>>> the brick (glusterfsd) processes displayed in `gluster volume status`
>>>>>>>>>>> matches with that of the actual port numbers observed (i.e. the
>>>>>>>>>>> --brick-port argument) when you run `ps aux | grep glusterfsd` ? If they
>>>>>>>>>>> don't match, then glusterd has incorrect brick port information in its
>>>>>>>>>>> memory and serving it to glustershd. Restarting glusterd instead of
>>>>>>>>>>> (killing the bricks + `volume start force`) should fix it, although we need
>>>>>>>>>>> to find why glusterd serves incorrect port numbers.
>>>>>>>>>>>
>>>>>>>>>>> If they do match, then can you take a statedump of glustershd
>>>>>>>>>>> to check that it is indeed disconnected from the bricks? You
>>>>>>>>>>> will need to verify that 'connected=1' in the statedump. See "Self-heal is
>>>>>>>>>>> stuck/ not getting completed." section in
>>>>>>>>>>> https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/.
>>>>>>>>>>> Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will be
>>>>>>>>>>> generated in the /var/run/gluster/ directory.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Ravi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>
>>> --
>>> Regards,
>>> Srijan
>>>
>>
>
> --
> Regards,
> Srijan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210531/21d4c926/attachment.html>