<div dir="ltr">Please run 'gluster v get all cluster.max-op-version' and what ever value it throws up should be used to bump up the cluster.op-version (gluster v set all cluster.op-version <value>) . With that if you restart the rejected peer I believe the problem should go away, if it doesn't I'd need to investigate further once you can pass down the glusterd and cmd_history log files and the content of /var/lib/glusterd from all the nodes.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Mar 7, 2018 at 4:13 AM, Jamie Lawrence <span dir="ltr"><<a href="mailto:jlawrence@squaretrade.com" target="_blank">jlawrence@squaretrade.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>
> On Mar 5, 2018, at 6:41 PM, Atin Mukherjee <<a href="mailto:amukherj@redhat.com">amukherj@redhat.com</a>> wrote:<br>
<br>
</span><span class="">> I'm tempted to repeat - down things, copy the checksum the "good" ones agree on, start things; but given that this has turned into a balloon-squeezing exercise, I want to make sure I'm not doing this the wrong way.<br>
><br>
> Yes, that's the way. Copy /var/lib/glusterd/vols/<<wbr>volname>/ from the good node to the rejected one and restart glusterd service on the rejected peer.<br>
<br>
<br>
</span>My apologies for the multiple messages - I'm having to work on this episodically.<br>
<br>
I've tried again to reset state on the bad peer, to no avail. This time I downed all of the peers, copied things over, ensuring that the tier-enabled line was absent and started back up; the cksum immediately changed to some a bad value, the two good nodes added that line in, and the bad node didn't have it.<br>
<br>
Just to have a clear view of this, I did it yet again, this time ensuring the tier-enbled line was present everywhere. Same result, except that it didn't add the tier-enabled line, which I suppose makes some sense.<br>
<br>
One oddity - I see:<br>
<br>
# gluster v get all cluster.op-version<br>
Option Value<br>
------ -----<br>
cluster.op-version 30800<br>
<br>
but from one of the `info` files:<br>
<br>
op-version=30712<br>
client-op-version=30712<br>
<br>
I don't know what it means that the cluster is at one version but apparently the volume is set for another - I thought that was a cluster-level setting. (Client.op-version theoretically makes more sense - I can see Ovirt wanting an older version.)<br>
<br>
I'm at a loss to fix this - copying /var/lib/glusterd/vol/<vol> over doesn't fix the problem. I'd be somewhat OK with trashing the volume and starting over, if it weren't for two things: (1) Ovirt was also a massive pain to set up, and it configured on this volume. But perhaps more importantly, I'm concerned with this happening again once this is in production, which would be Bad, especially if I don't have a fix.<br>
<br>
So at this point, I'm unclear on how to move forward or even where more to look for potential problems.<br>
<span class=""><br>
-j<br>
<br>
- - - -<br>
<br>
</span>[2018-03-06 22:30:32.421530] I [MSGID: 106490] [glusterd-handler.c:2540:__<wbr>glusterd_handle_incoming_<wbr>friend_req] 0-glusterd: Received probe from uuid: 77cdfbba-348c-43fe-ab3d-<wbr>00621904ea9c<br>
[2018-03-06 22:30:32.422582] E [MSGID: 106010] [glusterd-utils.c:3374:<wbr>glusterd_compare_friend_<wbr>volume] 0-management: Version of Cksums sc5-ovirt_engine differ. local cksum = 3949237931, remote cksum = 2068896937 on peer <a href="http://sc5-gluster-10g-1.squaretrade.com" rel="noreferrer" target="_blank">sc5-gluster-10g-1.squaretrade.<wbr>com</a><br>
[2018-03-06 22:30:32.422774] I [MSGID: 106493] [glusterd-handler.c:3800:<wbr>glusterd_xfer_friend_add_resp] 0-glusterd: Responded to <a href="http://sc5-gluster-10g-1.squaretrade.com" rel="noreferrer" target="_blank">sc5-gluster-10g-1.squaretrade.<wbr>com</a> (0), ret: 0, op_ret: -1<br>
[2018-03-06 22:30:32.424621] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__<wbr>glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 77cdfbba-348c-43fe-ab3d-<wbr>00621904ea9c, host: <a href="http://sc5-gluster-10g-1.squaretrade.com" rel="noreferrer" target="_blank">sc5-gluster-10g-1.squaretrade.<wbr>com</a>, port: 0<br>
[2018-03-06 22:30:32.425563] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__<wbr>glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: c1877e0d-ccb2-401e-83a6-<wbr>e4a680af683a, host: <a href="http://sc5-gluster-2.squaretrade.com" rel="noreferrer" target="_blank">sc5-gluster-2.squaretrade.com</a>, port: 0<br>
[2018-03-06 22:30:32.426706] I [MSGID: 106163] [glusterd-handshake.c:1316:__<wbr>glusterd_mgmt_hndsk_versions_<wbr>ack] 0-management: using the op-version 30800<br>
[2018-03-06 22:30:32.428075] I [MSGID: 106490] [glusterd-handler.c:2540:__<wbr>glusterd_handle_incoming_<wbr>friend_req] 0-glusterd: Received probe from uuid: c1877e0d-ccb2-401e-83a6-<wbr>e4a680af683a<br>
[2018-03-06 22:30:32.428325] E [MSGID: 106010] [glusterd-utils.c:3374:<wbr>glusterd_compare_friend_<wbr>volume] 0-management: Version of Cksums sc5-ovirt_engine differ. local cksum = 3949237931, remote cksum = 2068896937 on peer <a href="http://sc5-gluster-2.squaretrade.com" rel="noreferrer" target="_blank">sc5-gluster-2.squaretrade.com</a><br>
[2018-03-06 22:30:32.428468] I [MSGID: 106493] [glusterd-handler.c:3800:<wbr>glusterd_xfer_friend_add_resp] 0-glusterd: Responded to <a href="http://sc5-gluster-2.squaretrade.com" rel="noreferrer" target="_blank">sc5-gluster-2.squaretrade.com</a> (0), ret: 0, op_ret: -1<br>
<br>
</blockquote></div><br></div>