[Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)

Tue Oct 27 15:26:26 UTC 2020

If you use the same block device for the arbiter,  I would recommend you to 'mkfs' again.
For example , XFS brick will be done via 'mkfs.xfs -f -i size=512 /dev/DEVICE'.

Reusing a brick without recreating the FS is error-prone.

Also, don't forget to create your brick dir , once the device is mounted.

Best Regards,
Strahil Nikolov

В вторник, 27 октомври 2020 г., 08:41:11 Гринуич+2, mabi <mabi at protonmail.ch> написа: 

First to answer your question how this first happened, I reached that issue first by simply rebooting my arbiter node yesterday morning in order to due some maintenance which I do on a regular basis and was never a problem before GlusterFS 7.8.

I have now removed the arbiter brick from all of my volumes (I have 3 volumes and only one volume uses quota). So I was then able to do a "detach" and then a "probe" of my arbiter node.

So far so good, so I decided to add back an aribter brick to one of my smallest volumes which does not have quota but I get the following error message:

$ gluster volume add-brick othervol replica 3 arbiter 1 arbiternode.domain.tld:/srv/glusterfs/othervol/brick

volume add-brick: failed: Commit failed on arbiternode.domain.tld. Please check log file for details.

Checking the glusterd.log file of the arbiter node shows the following:

[2020-10-27 06:25:36.011955] I [MSGID: 106578] [glusterd-brick-ops.c:1024:glusterd_op_perform_add_bricks] 0-management: replica-count is set 3
[2020-10-27 06:25:36.011988] I [MSGID: 106578] [glusterd-brick-ops.c:1029:glusterd_op_perform_add_bricks] 0-management: arbiter-count is set 1
[2020-10-27 06:25:36.012017] I [MSGID: 106578] [glusterd-brick-ops.c:1033:glusterd_op_perform_add_bricks] 0-management: type is set 0, need to change it
[2020-10-27 06:25:36.093551] E [MSGID: 106053] [glusterd-utils.c:13790:glusterd_handle_replicate_brick_ops] 0-management: Failed to set extended attribute trusted.add-brick : Transport endpoint is not connected [Transport endpoint is not connected]
[2020-10-27 06:25:36.104897] E [MSGID: 101042] [compat.c:605:gf_umount_lazy] 0-management: Lazy unmount of /tmp/mntQQVzyD [Transport endpoint is not connected]
[2020-10-27 06:25:36.104973] E [MSGID: 106073] [glusterd-brick-ops.c:2051:glusterd_op_add_brick] 0-glusterd: Unable to add bricks
[2020-10-27 06:25:36.105001] E [MSGID: 106122] [glusterd-mgmt.c:317:gd_mgmt_v3_commit_fn] 0-management: Add-brick commit failed.
[2020-10-27 06:25:36.105023] E [MSGID: 106122] [glusterd-mgmt-handler.c:594:glusterd_handle_commit_fn] 0-management: commit failed on operation Add brick

After that I tried to restart the glusterd service on my arbiter node and now it is again rejected from the other nodes with exactly the same error message as yesterday regarding the quota checksum being different as you can see here:

[2020-10-27 06:30:21.729577] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvol-private differ. local cksum = 0, remote  cksum = 66908910 on peer node2.domain.tld
[2020-10-27 06:30:21.731966] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvol-private differ. local cksum = 0, remote  cksum = 66908910 on peer node1.domain.tld

This is really weird because at this stage I did not even try yet to add the brick to the arbiter node from my volume which has the quota enabled...

After detaching the arbiter node, am I supposed to delete something on the arbiter node?

Something is really wrong here and I am stuck in a loop somehow... any help would be greatly appreciated.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Tuesday, October 27, 2020 1:26 AM, Strahil Nikolov <hunter86_bg at yahoo.com> wrote:

> You need to fix that "reject" issue before trying anything else.
> Have you tried to "detach" the arbiter and then "probe" it again ?
>
> I have no idea what you did to reach that state - can you provide the details ?
>
> Best Regards,
> Strahil Nikolov
>
> В понеделник, 26 октомври 2020 г., 20:38:38 Гринуич+2, mabi mabi at protonmail.ch написа:
>
> Ok I see I won't go down that path of disabling quota.
>
> I could now remove the arbiter brick of my volume which has the quota issue so it is now a simple 2 nodes replica with 1 brick per node.
>
> Now I would like to add the brick back but I get the following error:
>
> volume add-brick: failed: Host arbiternode.domain.tld is not in 'Peer in Cluster' state
>
> In fact I checked and the arbiter node is still rejected as you can see here:
>
> State: Peer Rejected (Connected)
>
> On the arbiter node glusted.log file I see the following errors:
>
> [2020-10-26 18:35:05.605124] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume woelkli-private differ. local cksum = 0, remote  cksum = 66908910 on peer node1.domain.tld
> [2020-10-26 18:35:05.617009] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvol-private differ. local cksum = 0, remote  cksum = 66908910 on peer node2.domain.tld
>
> So although I have removed the arbiter brick from my volume it it still complains about that checksum of the quota configuration. I also tried to restart glusterd on my arbiter node but it does not help. The peer is still rejected.
>
> What should I do at this stage?
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>
> On Monday, October 26, 2020 6:06 PM, Strahil Nikolov hunter86_bg at yahoo.com wrote:
>
> > Detaching the arbiter is pointless...
> > Quota is an extended file attribute, and thus disabling and reenabling quota on a volume with millions of files will take a lot of time and lots of IOPS. I would leave it as a last resort.
> > Also, it was mentioned in the list about the following script that might help you:
> > https://github.com/gluster/glusterfs/blob/devel/extras/quota/quota_fsck.py
> > You can take a look in the mailing list for usage and more details.
> > Best Regards,
> > Strahil Nikolov
> > В понеделник, 26 октомври 2020 г., 16:40:06 Гринуич+2, Diego Zuccato diego.zuccato at unibo.it написа:
> > Il 26/10/20 15:09, mabi ha scritto:
> >
> > > Right, seen liked that this sounds reasonable. Do you actually remember the exact command you ran in order to remove the brick? I was thinking this should be it:
> > > gluster volume remove-brick <VOLNAME> <BRICK> force
> > > but should I use "force" or "start"?
> >
> > Memory does not serve me well (there are 28 disks, not 26!), but bash
> > history does :)
> > gluster volume remove-brick BigVol replica 2
> > =============================================
> > str957-biostq:/srv/arbiters/{00..27}/BigVol force
> > gluster peer detach str957-biostq
> > ==================================
> > gluster peer probe str957-biostq
> > =================================
> > gluster volume add-brick BigVol replica 3 arbiter 1
> > ====================================================
> > str957-biostq:/srv/arbiters/{00..27}/BigVol
> > You obviously have to wait for remove-brick to complete before detaching
> > arbiter.
> >
> > > > IIRC it took about 3 days, but the arbiters are on a VM (8CPU, 8GB RAM)
> > > > that uses an iSCSI disk. More than 80% continuous load on both CPUs and RAM.
> > > > That's quite long I must say and I am in the same case as you, my arbiter is a VM.
> >
> > Give all the CPU and RAM you can. Less than 8GB RAM is asking for
> > troubles (in my case).
> >
> > Diego Zuccato
> > DIFA - Dip. di Fisica e Astronomia
> > Servizi Informatici
> > Alma Mater Studiorum - Università di Bologna
> > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> > tel.: +39 051 20 95786
> > Community Meeting Calendar:
> > Schedule -
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > Bridge: https://bluejeans.com/441850968
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users