[Gluster-users] Volume stuck unable to add a brick

Karthik Subrahmanya ksubrahm at redhat.com
Tue Apr 16 15:26:00 UTC 2019


You're welcome!

On Tue 16 Apr, 2019, 7:12 PM Boris Goldowsky, <bgoldowsky at cast.org> wrote:

> That worked!  Thank you SO much!
>
>
>
> Boris
>
>
>
>
>
> *From: *Karthik Subrahmanya <ksubrahm at redhat.com>
> *Date: *Tuesday, April 16, 2019 at 8:20 AM
> *To: *Boris Goldowsky <bgoldowsky at cast.org>
> *Cc: *Atin Mukherjee <atin.mukherjee83 at gmail.com>, Gluster-users <
> gluster-users at gluster.org>
> *Subject: *Re: [Gluster-users] Volume stuck unable to add a brick
>
>
>
> Hi Boris,
>
>
>
> Thank you for providing the logs.
>
> The problem here is because of the "auth.allow: 127.0.0.1" setting on the
> volume.
>
> When you try to add a new brick to the volume internally replication
> module will try to set some metadata on the existing bricks to mark pending
> heal on the new brick, by creating a temporary mount. Because of the
> auth.allow setting that mount gets permission errors as seen in the below
> logs, leading to add-brick failure.
>
>
>
> From data-gluster-dockervols.log-webserver9 :
>
> [2019-04-15 14:00:34.226838] I [addr.c:55:compare_addr_and_update]
> 0-/data/gluster/dockervols: allowed = "127.0.0.1", received addr =
> "192.168.200.147"
>
> [2019-04-15 14:00:34.226895] E [MSGID: 115004]
> [authenticate.c:224:gf_authenticate] 0-auth: no authentication module is
> interested in accepting remote-client (null)
>
> [2019-04-15 14:00:34.227129] E [MSGID: 115001]
> [server-handshake.c:848:server_setvolume] 0-dockervols-server: Cannot
> authenticate client from
> webserver8.cast.org-55674-2019/04/15-14:00:20:495333-dockervols-client-2-0-0
> 3.12.2 [Permission denied]
>
>
>
> From dockervols-add-brick-mount.log :
>
> [2019-04-15 14:00:20.672033] W [MSGID: 114043]
> [client-handshake.c:1109:client_setvolume_cbk] 0-dockervols-client-2:
> failed to set the volume [Permission denied]
>
> [2019-04-15 14:00:20.672102] W [MSGID: 114007]
> [client-handshake.c:1138:client_setvolume_cbk] 0-dockervols-client-2:
> failed to get 'process-uuid' from reply dict [Invalid argument]
>
> [2019-04-15 14:00:20.672129] E [MSGID: 114044]
> [client-handshake.c:1144:client_setvolume_cbk] 0-dockervols-client-2:
> SETVOLUME on remote-host failed: Authentication failed [Permission denied]
>
> [2019-04-15 14:00:20.672151] I [MSGID: 114049]
> [client-handshake.c:1258:client_setvolume_cbk] 0-dockervols-client-2:
> sending AUTH_FAILED event
>
>
>
> This is a known issue and we are planning to fix this. For the time being
> we have a workaround for this.
>
> - Before you try adding the brick set the auth.allow option to default
> i.e., "*" or you can do this by running "gluster v reset <volname>
> auth.allow"
>
> - Add the brick
>
> - After it succeeds set back the auth.allow option to the previous value.
>
>
>
> Regards,
>
> Karthik
>
>
>
> On Tue, Apr 16, 2019 at 5:20 PM Boris Goldowsky <bgoldowsky at cast.org>
> wrote:
>
> OK, log files attached.
>
>
>
> Boris
>
>
>
>
>
> *From: *Karthik Subrahmanya <ksubrahm at redhat.com>
> *Date: *Tuesday, April 16, 2019 at 2:52 AM
> *To: *Atin Mukherjee <atin.mukherjee83 at gmail.com>, Boris Goldowsky <
> bgoldowsky at cast.org>
> *Cc: *Gluster-users <gluster-users at gluster.org>
> *Subject: *Re: [Gluster-users] Volume stuck unable to add a brick
>
>
>
>
>
>
>
> On Mon, Apr 15, 2019 at 9:43 PM Atin Mukherjee <atin.mukherjee83 at gmail.com>
> wrote:
>
> +Karthik Subrahmanya <ksubrahm at redhat.com>
>
>
>
> Didn't we we fix this problem recently? Failed to set extended attribute
> indicates that temp mount is failing and we don't have quorum number of
> bricks up.
>
>
>
> We had two fixes which handles two kind of add-brick scenarios.
>
> [1] Fails add-brick when increasing the replica count if any of the brick
> is down to avoid data loss. This can be overridden by using the force
> option.
>
> [2] Allow add-brick to set the extended attributes by the temp mount if
> the volume is already mounted (has clients).
>
>
>
> They are in version 3.12.2 so, patch [1] is present there. But since they
> are using the force option it should not have any problem even if they have
> any brick down. The error message they are getting is also different, so it
> is not because of any brick being down I guess.
>
> Patch [2] is not present in 3.12.2 and it is not the conversion from plain
> distribute to replicate volume. So the scenario is different here.
>
> It seems like they are hitting some other issue.
>
>
>
> @Boris,
>
> Can you attach the add-brick's temp mount log. The file name should look
> something like "dockervols-add-brick-mount.log". Can you also provide all
> the brick logs of that volume during that time.
>
>
>
> [1] https://review.gluster.org/#/c/glusterfs/+/16330/
>
> [2] https://review.gluster.org/#/c/glusterfs/+/21791/
>
>
>
> Regards,
>
> Karthik
>
>
>
> Boris - What's the gluster version are you using?
>
>
>
>
>
>
>
> On Mon, Apr 15, 2019 at 7:35 PM Boris Goldowsky <bgoldowsky at cast.org>
> wrote:
>
> Atin, thank you for the reply.  Here are all of those pieces of
> information:
>
>
>
> [bgoldowsky at webserver9 ~]$ gluster --version
>
> glusterfs 3.12.2
>
> (same on all nodes)
>
>
>
> [bgoldowsky at webserver9 ~]$ sudo gluster peer status
>
> Number of Peers: 3
>
>
>
> Hostname: webserver11.cast.org
>
> Uuid: c2b147fd-cab4-4859-9922-db5730f8549d
>
> State: Peer in Cluster (Connected)
>
>
>
> Hostname: webserver1.cast.org
>
> Uuid: 4b918f65-2c9d-478e-8648-81d1d6526d4c
>
> State: Peer in Cluster (Connected)
>
> Other names:
>
> 192.168.200.131
>
> webserver1
>
>
>
> Hostname: webserver8.cast.org
>
> Uuid: be2f568b-61c5-4016-9264-083e4e6453a2
>
> State: Peer in Cluster (Connected)
>
> Other names:
>
> webserver8
>
>
>
> [bgoldowsky at webserver1 ~]$ sudo gluster v info
>
> Volume Name: dockervols
>
> Type: Replicate
>
> Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 1 x 3 = 3
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: webserver1:/data/gluster/dockervols
>
> Brick2: webserver11:/data/gluster/dockervols
>
> Brick3: webserver9:/data/gluster/dockervols
>
> Options Reconfigured:
>
> nfs.disable: on
>
> transport.address-family: inet
>
> auth.allow: 127.0.0.1
>
>
>
> Volume Name: testvol
>
> Type: Replicate
>
> Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 1 x 4 = 4
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: webserver1:/data/gluster/testvol
>
> Brick2: webserver9:/data/gluster/testvol
>
> Brick3: webserver11:/data/gluster/testvol
>
> Brick4: webserver8:/data/gluster/testvol
>
> Options Reconfigured:
>
> transport.address-family: inet
>
> nfs.disable: on
>
>
>
> [bgoldowsky at webserver8 ~]$ sudo gluster v info
>
> Volume Name: dockervols
>
> Type: Replicate
>
> Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 1 x 3 = 3
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: webserver1:/data/gluster/dockervols
>
> Brick2: webserver11:/data/gluster/dockervols
>
> Brick3: webserver9:/data/gluster/dockervols
>
> Options Reconfigured:
>
> nfs.disable: on
>
> transport.address-family: inet
>
> auth.allow: 127.0.0.1
>
>
>
> Volume Name: testvol
>
> Type: Replicate
>
> Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 1 x 4 = 4
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: webserver1:/data/gluster/testvol
>
> Brick2: webserver9:/data/gluster/testvol
>
> Brick3: webserver11:/data/gluster/testvol
>
> Brick4: webserver8:/data/gluster/testvol
>
> Options Reconfigured:
>
> nfs.disable: on
>
> transport.address-family: inet
>
>
>
> [bgoldowsky at webserver9 ~]$ sudo gluster v info
>
> Volume Name: dockervols
>
> Type: Replicate
>
> Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 1 x 3 = 3
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: webserver1:/data/gluster/dockervols
>
> Brick2: webserver11:/data/gluster/dockervols
>
> Brick3: webserver9:/data/gluster/dockervols
>
> Options Reconfigured:
>
> nfs.disable: on
>
> transport.address-family: inet
>
> auth.allow: 127.0.0.1
>
>
>
> Volume Name: testvol
>
> Type: Replicate
>
> Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 1 x 4 = 4
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: webserver1:/data/gluster/testvol
>
> Brick2: webserver9:/data/gluster/testvol
>
> Brick3: webserver11:/data/gluster/testvol
>
> Brick4: webserver8:/data/gluster/testvol
>
> Options Reconfigured:
>
> nfs.disable: on
>
> transport.address-family: inet
>
>
>
> [bgoldowsky at webserver11 ~]$ sudo gluster v info
>
> Volume Name: dockervols
>
> Type: Replicate
>
> Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 1 x 3 = 3
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: webserver1:/data/gluster/dockervols
>
> Brick2: webserver11:/data/gluster/dockervols
>
> Brick3: webserver9:/data/gluster/dockervols
>
> Options Reconfigured:
>
> auth.allow: 127.0.0.1
>
> transport.address-family: inet
>
> nfs.disable: on
>
>
>
> Volume Name: testvol
>
> Type: Replicate
>
> Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 1 x 4 = 4
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: webserver1:/data/gluster/testvol
>
> Brick2: webserver9:/data/gluster/testvol
>
> Brick3: webserver11:/data/gluster/testvol
>
> Brick4: webserver8:/data/gluster/testvol
>
> Options Reconfigured:
>
> transport.address-family: inet
>
> nfs.disable: on
>
>
>
> [bgoldowsky at webserver9 ~]$ sudo gluster volume add-brick dockervols
> replica 4 webserver8:/data/gluster/dockervols force
>
> volume add-brick: failed: Commit failed on webserver8.cast.org. Please
> check log file for details.
>
>
>
> Webserver8 glusterd.log:
>
>
>
> [2019-04-15 13:55:42.338197] I [MSGID: 106488]
> [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management:
> Received get vol req
>
> The message "I [MSGID: 106488]
> [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management:
> Received get vol req" repeated 2 times between [2019-04-15 13:55:42.338197]
> and [2019-04-15 13:55:42.341618]
>
> [2019-04-15 14:00:20.445011] I [run.c:190:runner_log]
> (-->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x3a215)
> [0x7fe697764215]
> -->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0xe3e9d)
> [0x7fe69780de9d] -->/lib64/libglusterfs.so.0(runner_log+0x115)
> [0x7fe6a2d16ea5] ) 0-management: Ran script:
> /var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh
> --volname=dockervols --version=1 --volume-op=add-brick
> --gd-workdir=/var/lib/glusterd
>
> [2019-04-15 14:00:20.445148] I [MSGID: 106578]
> [glusterd-brick-ops.c:1354:glusterd_op_perform_add_bricks] 0-management:
> replica-count is set 4
>
> [2019-04-15 14:00:20.445184] I [MSGID: 106578]
> [glusterd-brick-ops.c:1364:glusterd_op_perform_add_bricks] 0-management:
> type is set 0, need to change it
>
> [2019-04-15 14:00:20.672347] E [MSGID: 106054]
> [glusterd-utils.c:13863:glusterd_handle_replicate_brick_ops] 0-management:
> Failed to set extended attribute trusted.add-brick : Transport endpoint is
> not connected [Transport endpoint is not connected]
>
> [2019-04-15 14:00:20.693491] E [MSGID: 101042]
> [compat.c:569:gf_umount_lazy] 0-management: Lazy unmount of /tmp/mntmvdFGq
> [Transport endpoint is not connected]
>
> [2019-04-15 14:00:20.693597] E [MSGID: 106074]
> [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to add
> bricks
>
> [2019-04-15 14:00:20.693637] E [MSGID: 106123]
> [glusterd-mgmt.c:312:gd_mgmt_v3_commit_fn] 0-management: Add-brick commit
> failed.
>
> [2019-04-15 14:00:20.693667] E [MSGID: 106123]
> [glusterd-mgmt-handler.c:616:glusterd_handle_commit_fn] 0-management:
> commit failed on operation Add brick
>
>
>
> Webserver11 log file:
>
>
>
> [2019-04-15 13:56:29.563270] I [MSGID: 106488]
> [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management:
> Received get vol req
>
> The message "I [MSGID: 106488]
> [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management:
> Received get vol req" repeated 2 times between [2019-04-15 13:56:29.563270]
> and [2019-04-15 13:56:29.566209]
>
> [2019-04-15 14:00:33.996866] I [run.c:190:runner_log]
> (-->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x3a215)
> [0x7f36de924215]
> -->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0xe3e9d)
> [0x7f36de9cde9d] -->/lib64/libglusterfs.so.0(runner_log+0x115)
> [0x7f36e9ed6ea5] ) 0-management: Ran script:
> /var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh
> --volname=dockervols --version=1 --volume-op=add-brick
> --gd-workdir=/var/lib/glusterd
>
> [2019-04-15 14:00:33.996979] I [MSGID: 106578]
> [glusterd-brick-ops.c:1354:glusterd_op_perform_add_bricks] 0-management:
> replica-count is set 4
>
> [2019-04-15 14:00:33.997004] I [MSGID: 106578]
> [glusterd-brick-ops.c:1364:glusterd_op_perform_add_bricks] 0-management:
> type is set 0, need to change it
>
> [2019-04-15 14:00:34.013789] I [MSGID: 106132]
> [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: nfs already
> stopped
>
> [2019-04-15 14:00:34.013849] I [MSGID: 106568]
> [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: nfs service is
> stopped
>
> [2019-04-15 14:00:34.017535] I [MSGID: 106568]
> [glusterd-proc-mgmt.c:88:glusterd_proc_stop] 0-management: Stopping
> glustershd daemon running in pid: 6087
>
> [2019-04-15 14:00:35.018783] I [MSGID: 106568]
> [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: glustershd
> service is stopped
>
> [2019-04-15 14:00:35.018952] I [MSGID: 106567]
> [glusterd-svc-mgmt.c:211:glusterd_svc_start] 0-management: Starting
> glustershd service
>
> [2019-04-15 14:00:35.028306] I [MSGID: 106132]
> [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: bitd already
> stopped
>
> [2019-04-15 14:00:35.028408] I [MSGID: 106568]
> [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: bitd service is
> stopped
>
> [2019-04-15 14:00:35.028601] I [MSGID: 106132]
> [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: scrub already
> stopped
>
> [2019-04-15 14:00:35.028645] I [MSGID: 106568]
> [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: scrub service is
> stopped
>
>
>
> Thank you for taking a look!
>
>
>
> Boris
>
>
>
>
>
> *From: *Atin Mukherjee <atin.mukherjee83 at gmail.com>
> *Date: *Friday, April 12, 2019 at 1:10 PM
> *To: *Boris Goldowsky <bgoldowsky at cast.org>
> *Cc: *Gluster-users <gluster-users at gluster.org>
> *Subject: *Re: [Gluster-users] Volume stuck unable to add a brick
>
>
>
>
>
>
>
> On Fri, 12 Apr 2019 at 22:32, Boris Goldowsky <bgoldowsky at cast.org> wrote:
>
> I’ve got a replicated volume with three bricks  (“1x3=3”), the idea is to
> have a common set of files that are locally available on all the machines
> (Scientific Linux 7, which is essentially CentOS 7) in a cluster.
>
>
>
> I tried to add on a fourth machine, so used a command like this:
>
>
>
> sudo gluster volume add-brick dockervols replica 4
> webserver8:/data/gluster/dockervols force
>
>
>
> but the result is:
>
> volume add-brick: failed: Commit failed on webserver1. Please check log
> file for details.
>
> Commit failed on webserver8. Please check log file for details.
>
> Commit failed on webserver11. Please check log file for details.
>
>
>
> Tried: removing the new brick (this also fails) and trying again.
>
> Tried: checking the logs. The log files are not enlightening to me – I
> don’t know what’s normal and what’s not.
>
>
>
> From webserver8 & webserver11 could you attach glusterd log files?
>
>
>
> Also please share following:
>
> - gluster version? (gluster —version)
>
> - Output of ‘gluster peer status’
>
> - Output of ‘gluster v info’ from all 4 nodes.
>
>
>
> Tried: deleting the brick directory from previous attempt, so that it’s
> not in the way.
>
> Tried: restarting gluster services
>
> Tried: rebooting
>
> Tried: setting up a new volume, replicated to all four machines. This
> works, so I’m assuming it’s not a networking issue.  But still fails with
> this existing volume that has the critical data in it.
>
>
>
> Running out of ideas. Any suggestions?  Thank you!
>
>
>
> Boris
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
> --
>
> --Atin
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190416/9e63b172/attachment.html>


More information about the Gluster-users mailing list