[Gluster-users] random disconnects of peers

Thu Aug 18 07:54:50 UTC 2022

What if you renice the gluster processes to some negative value?

<dpgluster at posteo.de> 于 2022年8月18日周四 09:45写道：

> Hi folks,
>
> i am running multiple GlusterFS servers in multiple datacenters. Every
> datacenter is basically the same setup: 3x storage nodes, 3x kvm
> hypervisors (oVirt) and 2x HPE switches which are acting as one logical
> unit. The NICs of all servers are attached to both switches with a
> bonding of two NICs, in case one of the switches has a major problem.
> In one datacenter i have strange problems with the glusterfs for nearly
> half of a year now and i'm not able to figure out the root cause.
>
> Enviorment
> - glusterfs 9.5 running on a centos 7.9.2009 (Core)
> - three gluster volumes, all options equally configured
>
> root at storage-001# gluster volume info
> Volume Name: g-volume-domain
> Type: Replicate
> Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain
> Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain
> Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain
> Options Reconfigured:
> client.event-threads: 4
> performance.cache-size: 1GB
> server.event-threads: 4
> server.allow-insecure: On
> network.ping-timeout: 42
> performance.client-io-threads: off
> nfs.disable: on
> transport.address-family: inet
> cluster.quorum-type: auto
> network.remote-dio: enable
> cluster.eager-lock: enable
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.quick-read: off
> cluster.data-self-heal-algorithm: diff
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.readdir-ahead: on
> performance.read-ahead: off
> client.ssl: off
> server.ssl: off
> auth.ssl-allow:
>
> storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain
> ssl.cipher-list: HIGH:!SSLv2
> cluster.shd-max-threads: 4
> diagnostics.latency-measurement: on
> diagnostics.count-fop-hits: on
> performance.io-thread-count: 32
>
> Problem
> The glusterd on one storage node seems to loose connection to one
> another storage node. If the problem occurs, the first message in
> /var/log/glusterfs/glusterd.log is always the following (variable values
> are filled with "x":
> [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
> <storage-00x.my.domain> (<xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx>), in
> state <Peer in Cluster>, has disconnected from glusterd.
>
> I will post a filtered log for this specific error on each of my storage
> nodes below.
> storage-001:
> root at storage-001# tail -n 100000 /var/log/glusterfs/glusterd.log | grep
> "has disconnected from" | grep "2022-08-16"
> [2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in
> state <Peer in Cluster>, has disconnected from glusterd.
> [2022-08-16 05:34:47.721060 +0000] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
> <storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>), in
> state <Peer in Cluster>, has disconnected from glusterd.
> [2022-08-16 06:01:22.472973 +0000] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in
> state <Peer in Cluster>, has disconnected from glusterd.
> root at storage-001#
>
> storage-002:
> root at storage-002# tail -n 100000 /var/log/glusterfs/glusterd.log | grep
> "has disconnected from" | grep "2022-08-16"
> [2022-08-16 05:01:34.502322 +0000] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
> <storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>), in
> state <Peer in Cluster>, has disconnected from glusterd.
> [2022-08-16 05:19:16.898406 +0000] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
> <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), in
> state <Peer in Cluster>, has disconnected from glusterd.
> [2022-08-16 06:01:22.462676 +0000] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
> <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), in
> state <Peer in Cluster>, has disconnected from glusterd.
> [2022-08-16 10:17:52.154501 +0000] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
> <storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), in
> state <Peer in Cluster>, has disconnected from glusterd.
> root at storage-002#
>
> storage-003:
> root at storage-003# tail -n 100000 /var/log/glusterfs/glusterd.log | grep
> "has disconnected from" | grep "2022-08-16"
> [2022-08-16 05:24:18.225432 +0000] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in
> state <Peer in Cluster>, has disconnected from glusterd.
> [2022-08-16 05:27:22.683234 +0000] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in
> state <Peer in Cluster>, has disconnected from glusterd.
> [2022-08-16 10:17:50.624775 +0000] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
> <storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in
> state <Peer in Cluster>, has disconnected from glusterd.
> root at storage-003#
>
> After this message it takes a couple secounds (in specific example of
> 2022-08-16 it's one to four secounds) and the disconnected node is
> reachable again:
> [2022-08-16 05:01:32.110518 +0000] I [MSGID: 106493]
> [glusterd-rpc-ops.c:474:__glusterd_friend_add_cbk] 0-glusterd: Received
> ACC from uuid: 8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6, host:
> storage-002.my.domain, port: 0
>
> This behavior is the same on all nodes - there is a disconnect of a
> gluster node and a couple secounds later the disconnected node is
> reachable again. After the reconnect the glustershd is invoked and heals
> all the data. How can i figure out the root cause of this random
> disconnects?
>
> My debugging actions so far:
> - check dmesg -> zero messages around the time of the disconnects
> - check the switch -> no port down/up, no packet errors
> - disabled ssl on the gluster volumes -> disconnects are still occuring
> - check the dropped/error packages on the network interface of the
> storage nodes -> no dropped packages, no errors
> - constant pingcheck between all nodes, while a disconnect occurs ->
> zero packet loss, zero high latencys
> - temporary deactivated one of the two interfaces which are building the
> bond -> disconnects are still occuring
> - updated gluster from 6.x to 9.5 -> disconnects are still occuring
>
> Important info: I can force this error to happen if i put some high
> i/o-load to one of the gluster volumes.
>
> I suspect there could be an issue with a network queue overflow or
> something like that, but that theory does not match the result of my
> pingcheck.
>
>
> What would be your next step to debug this error?
>
>
> Thanks in advance!
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20220818/07a626e8/attachment.html>