<div dir="auto">What if you renice the gluster processes to some negative value?</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr"> <<a href="mailto:dpgluster@posteo.de">dpgluster@posteo.de</a>> 于 2022年8月18日周四 09:45写道：<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi folks,<br>

<br>

i am running multiple GlusterFS servers in multiple datacenters. Every <br>

datacenter is basically the same setup: 3x storage nodes, 3x kvm <br>

hypervisors (oVirt) and 2x HPE switches which are acting as one logical <br>

unit. The NICs of all servers are attached to both switches with a <br>

bonding of two NICs, in case one of the switches has a major problem.<br>

In one datacenter i have strange problems with the glusterfs for nearly <br>

half of a year now and i'm not able to figure out the root cause.<br>

<br>

Enviorment<br>

- glusterfs 9.5 running on a centos 7.9.2009 (Core)<br>

- three gluster volumes, all options equally configured<br>

<br>

root@storage-001# gluster volume info<br>

Volume Name: g-volume-domain<br>

Type: Replicate<br>

Volume ID: ffd3baa5-6125-48da-a5a4-5ee3969cfbd0<br>

Status: Started<br>

Snapshot Count: 0<br>

Number of Bricks: 1 x 3 = 3<br>

Transport-type: tcp<br>

Bricks:<br>

Brick1: storage-003.my.domain:/mnt/bricks/g-volume-domain<br>

Brick2: storage-002.my.domain:/mnt/bricks/g-volume-domain<br>

Brick3: storage-001.my.domain:/mnt/bricks/g-volume-domain<br>

Options Reconfigured:<br>

client.event-threads: 4<br>

performance.cache-size: 1GB<br>

server.event-threads: 4<br>

server.allow-insecure: On<br>

network.ping-timeout: 42<br>

performance.client-io-threads: off<br>

nfs.disable: on<br>

transport.address-family: inet<br>

cluster.quorum-type: auto<br>

network.remote-dio: enable<br>

cluster.eager-lock: enable<br>

performance.stat-prefetch: off<br>

performance.io-cache: off<br>

performance.quick-read: off<br>

cluster.data-self-heal-algorithm: diff<br>

storage.owner-uid: 36<br>

storage.owner-gid: 36<br>

performance.readdir-ahead: on<br>

performance.read-ahead: off<br>

client.ssl: off<br>

server.ssl: off<br>

auth.ssl-allow: <br>

storage-001.my.domain,storage-002.my.domain,storage-003.my.domain,hv-001.my.domain,hv-002.my.domain,hv-003.my.domain<br>

ssl.cipher-list: HIGH:!SSLv2<br>

cluster.shd-max-threads: 4<br>

diagnostics.latency-measurement: on<br>

diagnostics.count-fop-hits: on<br>

performance.io-thread-count: 32<br>

<br>

Problem<br>

The glusterd on one storage node seems to loose connection to one <br>

another storage node. If the problem occurs, the first message in <br>

/var/log/glusterfs/glusterd.log is always the following (variable values <br>

are filled with "x":<br>

[2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004] <br>

[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <br>

<storage-00x.my.domain> (<xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx>), in <br>

state <Peer in Cluster>, has disconnected from glusterd.<br>

<br>

I will post a filtered log for this specific error on each of my storage <br>

nodes below.<br>

storage-001:<br>

root@storage-001# tail -n 100000 /var/log/glusterfs/glusterd.log | grep <br>

"has disconnected from" | grep "2022-08-16"<br>

[2022-08-16 05:01:28.615441 +0000] I [MSGID: 106004] <br>

[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <br>

<storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in <br>

state <Peer in Cluster>, has disconnected from glusterd.<br>

[2022-08-16 05:34:47.721060 +0000] I [MSGID: 106004] <br>

[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <br>

<storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>), in <br>

state <Peer in Cluster>, has disconnected from glusterd.<br>

[2022-08-16 06:01:22.472973 +0000] I [MSGID: 106004] <br>

[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <br>

<storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in <br>

state <Peer in Cluster>, has disconnected from glusterd.<br>

root@storage-001#<br>

<br>

storage-002:<br>

root@storage-002# tail -n 100000 /var/log/glusterfs/glusterd.log | grep <br>

"has disconnected from" | grep "2022-08-16"<br>

[2022-08-16 05:01:34.502322 +0000] I [MSGID: 106004] <br>

[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <br>

<storage-003.my.domain> (<a911feef-14c7-4740-a7ae-1d475a724c9f>), in <br>

state <Peer in Cluster>, has disconnected from glusterd.<br>

[2022-08-16 05:19:16.898406 +0000] I [MSGID: 106004] <br>

[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <br>

<storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), in <br>

state <Peer in Cluster>, has disconnected from glusterd.<br>

[2022-08-16 06:01:22.462676 +0000] I [MSGID: 106004] <br>

[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <br>

<storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), in <br>

state <Peer in Cluster>, has disconnected from glusterd.<br>

[2022-08-16 10:17:52.154501 +0000] I [MSGID: 106004] <br>

[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <br>

<storage-001.my.domain> (<c3e3941e-bb07-460e-8aea-03b17e2ddaff>), in <br>

state <Peer in Cluster>, has disconnected from glusterd.<br>

root@storage-002#<br>

<br>

storage-003:<br>

root@storage-003# tail -n 100000 /var/log/glusterfs/glusterd.log | grep <br>

"has disconnected from" | grep "2022-08-16"<br>

[2022-08-16 05:24:18.225432 +0000] I [MSGID: 106004] <br>

[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <br>

<storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in <br>

state <Peer in Cluster>, has disconnected from glusterd.<br>

[2022-08-16 05:27:22.683234 +0000] I [MSGID: 106004] <br>

[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <br>

<storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in <br>

state <Peer in Cluster>, has disconnected from glusterd.<br>

[2022-08-16 10:17:50.624775 +0000] I [MSGID: 106004] <br>

[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <br>

<storage-002.my.domain> (<8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6>), in <br>

state <Peer in Cluster>, has disconnected from glusterd.<br>

root@storage-003#<br>

<br>

After this message it takes a couple secounds (in specific example of <br>

2022-08-16 it's one to four secounds) and the disconnected node is <br>

reachable again:<br>

[2022-08-16 05:01:32.110518 +0000] I [MSGID: 106493] <br>

[glusterd-rpc-ops.c:474:__glusterd_friend_add_cbk] 0-glusterd: Received <br>

ACC from uuid: 8bb466f6-01d6-42f2-ba75-b7a1eebc5ac6, host: <br>

storage-002.my.domain, port: 0<br>

<br>

This behavior is the same on all nodes - there is a disconnect of a <br>

gluster node and a couple secounds later the disconnected node is <br>

reachable again. After the reconnect the glustershd is invoked and heals <br>

all the data. How can i figure out the root cause of this random <br>

disconnects?<br>

<br>

My debugging actions so far:<br>

- check dmesg -> zero messages around the time of the disconnects<br>

- check the switch -> no port down/up, no packet errors<br>

- disabled ssl on the gluster volumes -> disconnects are still occuring<br>

- check the dropped/error packages on the network interface of the <br>

storage nodes -> no dropped packages, no errors<br>

- constant pingcheck between all nodes, while a disconnect occurs -> <br>

zero packet loss, zero high latencys<br>

- temporary deactivated one of the two interfaces which are building the <br>

bond -> disconnects are still occuring<br>

- updated gluster from 6.x to 9.5 -> disconnects are still occuring<br>

<br>

Important info: I can force this error to happen if i put some high <br>

i/o-load to one of the gluster volumes.<br>

<br>

I suspect there could be an issue with a network queue overflow or <br>

something like that, but that theory does not match the result of my <br>

pingcheck.<br>

<br>

<br>

What would be your next step to debug this error?<br>

<br>

<br>

Thanks in advance!<br>

________<br>

<br>

<br>

<br>

Community Meeting Calendar:<br>

<br>

Schedule -<br>

Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>

Bridge: <a href="https://meet.google.com/cpu-eiue-hvk" rel="noreferrer noreferrer" target="_blank">https://meet.google.com/cpu-eiue-hvk</a><br>

Gluster-users mailing list<br>

<a href="mailto:Gluster-users@gluster.org" target="_blank" rel="noreferrer">Gluster-users@gluster.org</a><br>

<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>

</blockquote></div>