[Gluster-users] Unable to upgrade nodes because of cksums mismatch

Mon Dec 27 12:55:10 UTC 2021

Hey guys,

i have a problem upgrading our nodes from 8.3 to 10.0 - i just upgraded the
first node and run into "the cksums mismatch" problem. On the upgraded v10
node the checksums for all volumes are different than on the other v8
nodes. That leads to the node starting in a peer rejected state. I can only
resolve this by following the actions supposed here:
https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Administrator%20Guide/Resolving%20Peer%20Rejected/
(stopping glusterd, deleting /var/lib/glusterd/* (except glusterd.info),
start glusterd, probe a v8 peer, restart glusterd again)

The cluster seems healthy again, self-healing is started and everything
looks fine - only the newly created cksums are still different than on the
other nodes. That means this healthy state only lasts till i reboot the
node - where it all begins from the start - the nodes comes up as peer
rejected.

Now i'v read about the problem here:
https://github.com/gluster/glusterfs/issues/1332 (even though that
describes the problem should only occur when upgrading from earlier than v7)
or also here on the mailing list:
https://lists.gluster.org/pipermail/gluster-users/2021-November/039679.html
(i think i have the same problem, but unfortunately no solution given here)

Solutions seem to require upgrading all nodes and the problem should be
resolved when finally upgrading op.version - but i dont' think this
approach can be done online, and there's not really a way for me to do this
offline.

Why is this happening now and not when i upgraded from pre7 to 7? All my
nodes are 8.3 and op.version is 8000.

One thing i might have done "wrong" - as i upgraded to v8 i didn't set
"gluster volume set <volname> fips-mode-rchecksum on" on the volumes, i
think i just overlooked it in the docs. I have this option only set on 2
volumes i created after upgrading to v8. But even on those 2 the cksums
differ, so i guess it wouldn' help alot if i set the option on all other
volumes?

I really don't know what to do now, i kinda understand the problem but
don't know why this is happening on a overall v8 cluster. I can't take all
9 nodes down, upgrade all to v10 and rely on "it's all good" with the final
upgrade of op.version.

Can someone point me in a safe direction?

Regards

Mika
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20211227/bce0d481/attachment.html>