[Gluster-users] Upgrading from Gluster 3.8 to 3.12

Wed Dec 20 05:58:10 UTC 2017

Looks like a bug as I see tier-enabled = 0 is an additional entry in the
info file in shchhv01. As per the code, this field should be written into
the glusterd store if the op-version is >= 30706 . What I am guessing is
since we didn't have the commit 33f8703a1 "glusterd: regenerate volfiles on
op-version bump up" in 3.8.4 while bumping up the op-version the info and
volfiles were not regenerated which caused the tier-enabled entry to be
missing in the info file.

For now, you can copy the info file for the volumes where the mismatch
happened from shchhv01 to shchhv02 and restart glusterd service on
shchhv02. That should fix up this temporarily. Unfortunately this step
might need to be repeated for other nodes as well.

@Hari - Could you help in debugging this further.

On Wed, Dec 20, 2017 at 10:44 AM, Gustave Dahl <gustave at dahlfamily.net>
wrote:

> I was attempting the same on a local sandbox and also have the same
> problem.
>
>
> Current: 3.8.4
>
> Volume Name: shchst01
> Type: Distributed-Replicate
> Volume ID: bcd53e52-cde6-4e58-85f9-71d230b7b0d3
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 4 x 3 = 12
> Transport-type: tcp
> Bricks:
> Brick1: shchhv01-sto:/data/brick3/shchst01
> Brick2: shchhv02-sto:/data/brick3/shchst01
> Brick3: shchhv03-sto:/data/brick3/shchst01
> Brick4: shchhv01-sto:/data/brick1/shchst01
> Brick5: shchhv02-sto:/data/brick1/shchst01
> Brick6: shchhv03-sto:/data/brick1/shchst01
> Brick7: shchhv02-sto:/data/brick2/shchst01
> Brick8: shchhv03-sto:/data/brick2/shchst01
> Brick9: shchhv04-sto:/data/brick2/shchst01
> Brick10: shchhv02-sto:/data/brick4/shchst01
> Brick11: shchhv03-sto:/data/brick4/shchst01
> Brick12: shchhv04-sto:/data/brick4/shchst01
> Options Reconfigured:
> cluster.data-self-heal-algorithm: full
> features.shard-block-size: 512MB
> features.shard: enable
> performance.readdir-ahead: on
> storage.owner-uid: 9869
> storage.owner-gid: 9869
> server.allow-insecure: on
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> cluster.self-heal-daemon: on
> nfs.disable: on
> performance.io-thread-count: 64
> performance.cache-size: 1GB
>
> Upgraded shchhv01-sto to 3.12.3, others remain at 3.8.4
>
> RESULT
> =====================
> Hostname: shchhv01-sto
> Uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816
> State: Peer Rejected (Connected)
>
> Upgraded Server:  shchhv01-sto
> ==============================
> [2017-12-20 05:02:44.747313] I [MSGID: 101190]
> [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread
> with
> index 1
> [2017-12-20 05:02:44.747387] I [MSGID: 101190]
> [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread
> with
> index 2
> [2017-12-20 05:02:44.749087] W [rpc-clnt-ping.c:246:rpc_clnt_ping_cbk]
> 0-management: RPC_CLNT_PING notify failed
> [2017-12-20 05:02:44.749165] W [rpc-clnt-ping.c:246:rpc_clnt_ping_cbk]
> 0-management: RPC_CLNT_PING notify failed
> [2017-12-20 05:02:44.749563] W [rpc-clnt-ping.c:246:rpc_clnt_ping_cbk]
> 0-management: RPC_CLNT_PING notify failed
> [2017-12-20 05:02:54.676324] I [MSGID: 106493]
> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received
> RJT
> from uuid: 546503ae-ba0e-40d4-843f-c5dbac22d272, host: shchhv02-sto,
> port: 0
> [2017-12-20 05:02:54.690237] I [MSGID: 106163]
> [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack]
> 0-management:
> using the op-version 30800
> [2017-12-20 05:02:54.695823] I [MSGID: 106490]
> [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req]
> 0-glusterd:
> Received probe from uuid: 546503ae-ba0e-40d4-843f-c5dbac22d272
> [2017-12-20 05:02:54.696956] E [MSGID: 106010]
> [glusterd-utils.c:3370:glusterd_compare_friend_volume] 0-management:
> Version
> of Cksums shchst01-sto differ. local cksum = 4218452135, remote cksum =
> 2747317484 on peer shchhv02-sto
> [2017-12-20 05:02:54.697796] I [MSGID: 106493]
> [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd:
> Responded to shchhv02-sto (0), ret: 0, op_ret: -1
> [2017-12-20 05:02:55.033822] I [MSGID: 106493]
> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received
> RJT
> from uuid: 3de22cb5-c1c1-4041-a1e1-eb969afa9b4b, host: shchhv03-sto,
> port: 0
> [2017-12-20 05:02:55.038460] I [MSGID: 106163]
> [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack]
> 0-management:
> using the op-version 30800
> [2017-12-20 05:02:55.040032] I [MSGID: 106490]
> [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req]
> 0-glusterd:
> Received probe from uuid: 3de22cb5-c1c1-4041-a1e1-eb969afa9b4b
> [2017-12-20 05:02:55.040266] E [MSGID: 106010]
> [glusterd-utils.c:3370:glusterd_compare_friend_volume] 0-management:
> Version
> of Cksums shchst01-sto differ. local cksum = 4218452135, remote cksum =
> 2747317484 on peer shchhv03-sto
> [2017-12-20 05:02:55.040405] I [MSGID: 106493]
> [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd:
> Responded to shchhv03-sto (0), ret: 0, op_ret: -1
> [2017-12-20 05:02:55.584854] I [MSGID: 106493]
> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received
> RJT
> from uuid: 36306e37-d7f0-4fec-9140-0d0f1bd2d2d5, host: shchhv04-sto,
> port: 0
> [2017-12-20 05:02:55.595125] I [MSGID: 106163]
> [glusterd-handshake.c:1316:__glusterd_mgmt_hndsk_versions_ack]
> 0-management:
> using the op-version 30800
> [2017-12-20 05:02:55.600804] I [MSGID: 106490]
> [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req]
> 0-glusterd:
> Received probe from uuid: 36306e37-d7f0-4fec-9140-0d0f1bd2d2d5
> [2017-12-20 05:02:55.601288] E [MSGID: 106010]
> [glusterd-utils.c:3370:glusterd_compare_friend_volume] 0-management:
> Version
> of Cksums shchst01-sto differ. local cksum = 4218452135, remote cksum =
> 2747317484 on peer shchhv04-sto
> [2017-12-20 05:02:55.601497] I [MSGID: 106493]
> [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd:
> Responded to shchhv04-sto (0), ret: 0, op_ret: -1
>
> Another Server:  shchhv02-sto
> ==============================
> [2017-12-20 05:02:44.667833] W
> [glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
> (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x1de5c)
> [0x7f75fdc12e5c]
> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x27a08)
> [0x7f75fdc1ca08]
> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd07fa)
> [0x7f75fdcc57fa] ) 0-management: Lock for vol shchst01-sto not held
> [2017-12-20 05:02:44.667795] I [MSGID: 106004]
> [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer
> <shchhv01-sto> (<f6205edb-a0ea-4247-9594-c4cdc0d05816>), in state <Peer
> Rejected>, has disconnected from glusterd.
> [2017-12-20 05:02:44.667948] W [MSGID: 106118]
> [glusterd-handler.c:5241:__glusterd_peer_rpc_notify] 0-management: Lock
> not
> released for shchst01-sto
> [2017-12-20 05:02:44.760103] I [MSGID: 106163]
> [glusterd-handshake.c:1271:__glusterd_mgmt_hndsk_versions_ack]
> 0-management:
> using the op-version 30800
> [2017-12-20 05:02:44.765389] I [MSGID: 106490]
> [glusterd-handler.c:2608:__glusterd_handle_incoming_friend_req]
> 0-glusterd:
> Received probe from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816
> [2017-12-20 05:02:54.686185] E [MSGID: 106010]
> [glusterd-utils.c:2930:glusterd_compare_friend_volume] 0-management:
> Version
> of Cksums shchst01 differ. local cksum = 2747317484, remote cksum =
> 4218452135 on peer shchhv01-sto
> [2017-12-20 05:02:54.686882] I [MSGID: 106493]
> [glusterd-handler.c:3852:glusterd_xfer_friend_add_resp] 0-glusterd:
> Responded to shchhv01-sto (0), ret: 0, op_ret: -1
> [2017-12-20 05:02:54.717854] I [MSGID: 106493]
> [glusterd-rpc-ops.c:476:__glusterd_friend_add_cbk] 0-glusterd: Received
> RJT
> from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816, host: shchhv01-sto,
> port: 0
>
> Another Server:  shchhv04-sto
> ==============================
> [2017-12-20 05:02:44.667620] I [MSGID: 106004]
> [glusterd-handler.c:5219:__glusterd_peer_rpc_notify] 0-management: Peer
> <shchhv01-sto> (<f6205edb-a0ea-4247-9594-c4cdc0d05816>), in state <Peer
> Rejected>, has disconnected from glusterd.
> [2017-12-20 05:02:44.667808] W
> [glusterd-locks.c:675:glusterd_mgmt_v3_unlock]
> (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x1de5c)
> [0x7f10a33d9e5c]
> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x27a08)
> [0x7f10a33e3a08]
> -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd07fa)
> [0x7f10a348c7fa] ) 0-management: Lock for vol shchst01-sto not held
> [2017-12-20 05:02:44.667827] W [MSGID: 106118]
> [glusterd-handler.c:5241:__glusterd_peer_rpc_notify] 0-management: Lock
> not
> released for shchst01-sto
> [2017-12-20 05:02:44.760077] I [MSGID: 106163]
> [glusterd-handshake.c:1271:__glusterd_mgmt_hndsk_versions_ack]
> 0-management:
> using the op-version 30800
> [2017-12-20 05:02:44.768796] I [MSGID: 106490]
> [glusterd-handler.c:2608:__glusterd_handle_incoming_friend_req]
> 0-glusterd:
> Received probe from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816
> [2017-12-20 05:02:55.595095] E [MSGID: 106010]
> [glusterd-utils.c:2930:glusterd_compare_friend_volume] 0-management:
> Version
> of Cksums shchst01-sto differ. local cksum = 2747317484, remote cksum =
> 4218452135 on peer shchhv01-sto
> [2017-12-20 05:02:55.595273] I [MSGID: 106493]
> [glusterd-handler.c:3852:glusterd_xfer_friend_add_resp] 0-glusterd:
> Responded to shchhv01-sto (0), ret: 0, op_ret: -1
> [2017-12-20 05:02:55.612957] I [MSGID: 106493]
> [glusterd-rpc-ops.c:476:__glusterd_friend_add_cbk] 0-glusterd: Received
> RJT
> from uuid: f6205edb-a0ea-4247-9594-c4cdc0d05816, host: shchhv01-sto,
> port: 0
>
> <vol>/info
>
> Upgraded Server: shchst01-sto
> =========================
> type=2
> count=12
> status=1
> sub_count=3
> stripe_count=1
> replica_count=3
> disperse_count=0
> redundancy_count=0
> version=52
> transport-type=0
> volume-id=bcd53e52-cde6-4e58-85f9-71d230b7b0d3
> username=5a4ae8d8-dbcb-408e-ab73-629255c14ffc
> password=58652573-0955-4d00-893a-9f42d0f16717
> op-version=30700
> client-op-version=30700
> quota-version=0
> tier-enabled=0
> parent_volname=N/A
> restored_from_snap=00000000-0000-0000-0000-000000000000
> snap-max-hard-limit=256
> cluster.data-self-heal-algorithm=full
> features.shard-block-size=512MB
> features.shard=enable
> nfs.disable=on
> cluster.self-heal-daemon=on
> cluster.server-quorum-type=server
> cluster.quorum-type=auto
> network.remote-dio=enable
> cluster.eager-lock=enable
> performance.stat-prefetch=off
> performance.io-cache=off
> performance.read-ahead=off
> performance.quick-read=off
> server.allow-insecure=on
> storage.owner-gid=9869
> storage.owner-uid=9869
> performance.readdir-ahead=on
> performance.io-thread-count=64
> performance.cache-size=1GB
> brick-0=shchhv01-sto:-data-brick3-shchst01
> brick-1=shchhv02-sto:-data-brick3-shchst01
> brick-2=shchhv03-sto:-data-brick3-shchst01
> brick-3=shchhv01-sto:-data-brick1-shchst01
> brick-4=shchhv02-sto:-data-brick1-shchst01
> brick-5=shchhv03-sto:-data-brick1-shchst01
> brick-6=shchhv02-sto:-data-brick2-shchst01
> brick-7=shchhv03-sto:-data-brick2-shchst01
> brick-8=shchhv04-sto:-data-brick2-shchst01
> brick-9=shchhv02-sto:-data-brick4-shchst01
> brick-10=shchhv03-sto:-data-brick4-shchst01
> brick-11=shchhv04-sto:-data-brick4-shchst01
>
> Another Server:  shchhv02-sto
> ==============================
> type=2
> count=12
> status=1
> sub_count=3
> stripe_count=1
> replica_count=3
> disperse_count=0
> redundancy_count=0
> version=52
> transport-type=0
> volume-id=bcd53e52-cde6-4e58-85f9-71d230b7b0d3
> username=5a4ae8d8-dbcb-408e-ab73-629255c14ffc
> password=58652573-0955-4d00-893a-9f42d0f16717
> op-version=30700
> client-op-version=30700
> quota-version=0
> parent_volname=N/A
> restored_from_snap=00000000-0000-0000-0000-000000000000
> snap-max-hard-limit=256
> cluster.data-self-heal-algorithm=full
> features.shard-block-size=512MB
> features.shard=enable
> performance.readdir-ahead=on
> storage.owner-uid=9869
> storage.owner-gid=9869
> server.allow-insecure=on
> performance.quick-read=off
> performance.read-ahead=off
> performance.io-cache=off
> performance.stat-prefetch=off
> cluster.eager-lock=enable
> network.remote-dio=enable
> cluster.quorum-type=auto
> cluster.server-quorum-type=server
> cluster.self-heal-daemon=on
> nfs.disable=on
> performance.io-thread-count=64
> performance.cache-size=1GB
> brick-0=shchhv01-sto:-data-brick3-shchst01
> brick-1=shchhv02-sto:-data-brick3-shchst01
> brick-2=shchhv03-sto:-data-brick3-shchst01
> brick-3=shchhv01-sto:-data-brick1-shchst01
> brick-4=shchhv02-sto:-data-brick1-shchst01
> brick-5=shchhv03-sto:-data-brick1-shchst01
> brick-6=shchhv02-sto:-data-brick2-shchst01
> brick-7=shchhv03-sto:-data-brick2-shchst01
> brick-8=shchhv04-sto:-data-brick2-shchst01
> brick-9=shchhv02-sto:-data-brick4-shchst01
> brick-10=shchhv03-sto:-data-brick4-shchst01
> brick-11=shchhv04-sto:-data-brick4-shchst01
>
> NOTE
>
> [root at shchhv01 shchst01]# gluster volume get shchst01 cluster.op-version
> Warning: Support to get global option value using `volume get <volname>`
> will be deprecated from next release. Consider using `volume get all`
> instead for global options
> Option                                  Value
>
> ------                                  -----
>
> cluster.op-version                      30800
>
> [root at shchhv02 shchst01]# gluster volume get shchst01 cluster.op-version
> Option                                  Value
>
> ------                                  -----
>
> cluster.op-version                      30800
>
> -----Original Message-----
> From: gluster-users-bounces at gluster.org
> [mailto:gluster-users-bounces at gluster.org] On Behalf Of Ziemowit Pierzycki
> Sent: Tuesday, December 19, 2017 3:56 PM
> To: gluster-users <gluster-users at gluster.org>
> Subject: Re: [Gluster-users] Upgrading from Gluster 3.8 to 3.12
>
> I have not done the upgrade yet.  Since this is a production cluster I need
> to make sure it stays up or schedule some downtime if it doesn't doesn't.
> Thanks.
>
> On Tue, Dec 19, 2017 at 10:11 AM, Atin Mukherjee <amukherj at redhat.com>
> wrote:
> >
> >
> > On Tue, Dec 19, 2017 at 1:10 AM, Ziemowit Pierzycki
> > <ziemowit at pierzycki.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> I have a cluster of 10 servers all running Fedora 24 along with
> >> Gluster 3.8.  I'm planning on doing rolling upgrades to Fedora 27
> >> with Gluster 3.12.  I saw the documentation and did some testing but
> >> I would like to run my plan through some (more?) educated minds.
> >>
> >> The current setup is:
> >>
> >> Volume Name: vol0
> >> Distributed-Replicate
> >> Number of Bricks: 2 x (2 + 1) = 6
> >> Bricks:
> >> Brick1: glt01:/vol/vol0
> >> Brick2: glt02:/vol/vol0
> >> Brick3: glt05:/vol/vol0 (arbiter)
> >> Brick4: glt03:/vol/vol0
> >> Brick5: glt04:/vol/vol0
> >> Brick6: glt06:/vol/vol0 (arbiter)
> >>
> >> Volume Name: vol1
> >> Distributed-Replicate
> >> Number of Bricks: 2 x (2 + 1) = 6
> >> Bricks:
> >> Brick1: glt07:/vol/vol1
> >> Brick2: glt08:/vol/vol1
> >> Brick3: glt05:/vol/vol1 (arbiter)
> >> Brick4: glt09:/vol/vol1
> >> Brick5: glt10:/vol/vol1
> >> Brick6: glt06:/vol/vol1 (arbiter)
> >>
> >> After performing the upgrade because of differences in checksums, the
> >> upgraded nodes will become:
> >>
> >> State: Peer Rejected (Connected)
> >
> >
> > Have you upgraded all the nodes? If yes, have you bumped up the
> > cluster.op-version after upgrading all the nodes? Please follow :
> > http://docs.gluster.org/en/latest/Upgrade-Guide/op_version/ for more
> > details on how to bump up the cluster.op-version. In case you have
> > done all of these and you're seeing a checksum issue then I'm afraid
> > you have hit a bug. I'd need further details like the checksum
> > mismatch error from glusterd.log file along with the the exact
> > volume's info file from /var/lib/glusterd/vols/<volname>/info between
> > both the peers to debug this further.
> >
> >>
> >> If I start doing the upgrades one at a time, with nodes glt10 to
> >> glt01 except for the arbiters glt05 and glt06, and then upgrading the
> >> arbiters last, everything should remain online at all times through
> >> the process.  Correct?
> >>
> >> Thanks.
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> http://lists.gluster.org/mailman/listinfo/gluster-users
> >
> >
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20171220/8e7570ea/attachment.html>