[Gluster-users] Gluster replicate-brick issues (Distrubuted-Replica)

Sat Feb 14 08:58:12 UTC 2015

We have tried to migrate a Brick from one server to another using the following commands.   But the data is NOT being replicated... and the BRICK is not showing up anymore.
Gluster still appears to be working but the Bricks are not balanced and I need to add the other Brick for Server3 that I don't want to do until after Server1:Brick2 gets replicated.

This is the command to create the Original Volume:
[root at Server1 ~]# gluster volume create Storage1 replica 2 transport tcp Server1:/exp/br01/brick1 Server2:/exp/br01/brick1 Server1:/exp/br02/brick2 Server2:/exp/br02/brick2

This is the Current configuration BEFORE the migration.. Server3 has been Peer Probed successfully but that has been it
[root at Server1 ~]# gluster --version
glusterfs 3.6.2 built on Jan 22 2015 12:58:11

[root at Server1 ~]# gluster volume status
Status of volume: Storage1
Gluster process                 Port    Online  Pid
------------------------------------------------------------------------------
Brick Server1:/exp/br01/brick1  49152   Y       2167
Brick Server2:/exp/br01/brick1  49152   Y       2192
Brick Server1:/exp/br02/brick2  49153   Y       2172   <--- this is the one that goes missing
Brick Server2:/exp/br02/brick2  49153   Y       2193
NFS Server on localhost         2049    Y       2181
Self-heal Daemon on localhost   N/A     Y       2186
NFS Server on Server2           2049    Y       2205
Self-heal Daemon on Server2     N/A     Y       2210
NFS Server on Server3           2049    Y       6015
Self-heal Daemon on Server3     N/A     Y       6016

Task Status of Volume Storage1
------------------------------------------------------------------------------
There are no active volume tasks
[root at Server1 ~]# gluster volume info

Volume Name: Storage1
Type: Distributed-Replicate
Volume ID: 9616ce42-48bd-4fe3-883f-decd6c4fcd00
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: Server1:/exp/br01/brick1
Brick2: Server2:/exp/br01/brick1
Brick3: Server1:/exp/br02/brick2
Brick4: Server2:/exp/br02/brick2
Options Reconfigured:
diagnostics.brick-log-level: WARNING
diagnostics.client-log-level: WARNING
cluster.entry-self-heal: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
performance.cache-size: 1024MB
performance.cache-max-file-size: 2MB
performance.cache-refresh-timeout: 1
performance.stat-prefetch: off
performance.read-ahead: on
performance.quick-read: off
performance.write-behind-window-size: 4MB
performance.flush-behind: on
performance.write-behind: on
performance.io-thread-count: 32
performance.io-cache: on
network.ping-timeout: 2
nfs.addr-namelookup: off
performance.strict-write-ordering: on
[root at Server1 ~]#

So we start the Migration of the Brick to the new server using the replace Brick command
[root at Server1 ~]# volname=Storage1

[root at Server1 ~]# from=Server1:/exp/br02/brick2

[root at Server1 ~]# to=Server3:/exp/br02/brick2

[root at Server1 ~]# gluster volume replace-brick $volname $from $to start
All replace-brick commands except commit force are deprecated. Do you want to continue? (y/n) y
volume replace-brick: success: replace-brick started successfully
ID: 0062d555-e7eb-4ebe-a264-7e0baf6e7546

[root at Server1 ~]# gluster volume replace-brick $volname $from $to status
All replace-brick commands except commit force are deprecated. Do you want to continue? (y/n) y
volume replace-brick: success: Number of files migrated = 281   Migration complete

At this point everything seems to be in order with no outstanding issues.

[root at Server1 ~]# gluster volume status
Status of volume: Storage1
Gluster process                 Port    Online  Pid
------------------------------------------------------------------------------
Brick Server1:/exp/br01/brick1  49152   Y       2167
Brick Server2:/exp/br01/brick1  49152   Y       2192
Brick Server1:/exp/br02/brick2  49153   Y       27557
Brick Server2:/exp/br02/brick2  49153   Y       2193
NFS Server on localhost         2049    Y       27562
Self-heal Daemon on localhost   N/A     Y       2186
NFS Server on Server2           2049    Y       2205
Self-heal Daemon on Server2     N/A     Y       2210
NFS Server on Server3           2049    Y       6015
Self-heal Daemon on Server3     N/A     Y       6016

Task Status of Volume Storage1
------------------------------------------------------------------------------
Task                 : Replace brick
ID                   : 0062d555-e7eb-4ebe-a264-7e0baf6e7546
Source Brick         : Server1:/exp/br02/brick2
Destination Brick    : Server3:/exp/br02/brick2
Status               : completed

The volume reports that the replace Brick command completed.. so the next step is to commit the change

[root at Server1 ~]# gluster volume replace-brick $volname $from $to commit
All replace-brick commands except commit force are deprecated. Do you want to continue? (y/n) y
volume replace-brick: success: replace-brick commit successful

At this point when I take a look at the status I see that the OLD brick is now missing (Server1:/exp/br02/brick2) AND I don't see the new Brick... WTF... panic!

[root at Server1 ~]# gluster volume status
Status of volume: Storage1
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick Server1:/exp/br01/brick1  49152   Y       2167
Brick Server2:/exp/br01/brick1  49152   Y       2192
Brick Server2:/exp/br02/brick2  49153   Y       2193
NFS Server on localhost         2049    Y       28906
Self-heal Daemon on localhost   N/A     Y       28911
NFS Server on Server2           2049    Y       2205
Self-heal Daemon on Server2     N/A     Y       2210
NFS Server on Server3           2049    Y       6015
Self-heal Daemon on Server3     N/A     Y       6016

Task Status of Volume Storage1
------------------------------------------------------------------------------
There are no active volume tasks

After the commit on Server1 it does not have the Tasks listed anymore... yet server2 and server3 see this

[root at Server2 ~]# gluster volume status
Status of volume: Storage1
Gluster process                 Port    Online  Pid
------------------------------------------------------------------------------
Brick Server1:/exp/br01/brick1  49152   Y       2167
Brick Server2:/exp/br01/brick1  49152   Y       2192
Brick Server2:/exp/br02/brick2  49153   Y       2193
NFS Server on localhost         2049    Y       2205
Self-heal Daemon on localhost   N/A     Y       2210
NFS Server on 10.45.16.17       2049    Y       28906
Self-heal Daemon on 10.45.16.17 N/A     Y       28911
NFS Server on server3           2049    Y       6015
Self-heal Daemon on server3     N/A     Y       6016

Task Status of Volume Storage1
------------------------------------------------------------------------------
Task                 : Replace brick
ID                   : 0062d555-e7eb-4ebe-a264-7e0baf6e7546
Source Brick         : Server1:/exp/br02/brick2
Destination Brick    : server3:/exp/br02/brick2
Status               : completed

If I navigate the brick on Server3 the brick is NOT empty.. but missing A LOT!  It's like the replace brick stopped... and never restarted again.
The replace brick reported back "Number of files migrated = 281   Migration complete" but when I look on Server3 Brick I get:
       [root at Server3 brick2]# find . -type f -print | wc -l
16

I'm missing 265 files.. (they still exist on the OLD brick.. but how can I move it?)

If I try to add the old brick back with another brick on the new server as such
[root at Server1 ~]# gluster volume add-brick Storage1 Server1:/exp/br02/brick2 Server3:/exp/br01/brick1
volume add-brick: failed: /exp/br02/brick2 is already part of a volume

Im fearfull of running:
[root at Server1 ~]# setfattr -n trusted.glusterfs.volume-id -v 0x$(grep volume-id /var/lib/glusterd/vols/$volname/info | cut -d= -f2 | sed 's/-//g') /exp/br02/brick2
Although it should allow me to add the brick

Gluster Heal info returns:
[root at Server2 ~]# gluster volume heal Storage1 info
Brick Server1:/exp/br01/brick1/
Number of entries: 0

Brick Server2:/exp/br01/brick1/
Number of entries: 0

Brick Server1:/exp/br02/brick2
Status: Transport endpoint is not connected

Brick Server2:/exp/br02/brick2/
Number of entries: 0

I have restarted glusterd numerous times.

at this time I'm not sure where to go from here... I know that the Server1:/exp/br02/brick2 still has all the data.. and Server3:/exp/br01/brick1 is not complete

How do I actually get the brick to replicate?
How can I add Server1:/exp/br02/brick2 back into the trusted pool if I can't replicate it, or re-add it?
How can I fix this to get it back into a replicated state between the three servers?

Thomas

----DATA----

Gluster volume info at this point
[root at Server1 ~]# gluster volume info

Volume Name: Storage1
Type: Distributed-Replicate
Volume ID: 9616ce42-48bd-4fe3-883f-decd6c4fcd00
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: Server1:/exp/br01/brick1
Brick2: Server2:/exp/br01/brick1
Brick3: server3:/exp/br02/brick2
Brick4: Server2:/exp/br02/brick2
Options Reconfigured:
diagnostics.brick-log-level: WARNING
diagnostics.client-log-level: WARNING
cluster.entry-self-heal: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
performance.cache-size: 1024MB
performance.cache-max-file-size: 2MB
performance.cache-refresh-timeout: 1
performance.stat-prefetch: off
performance.read-ahead: on
performance.quick-read: off
performance.write-behind-window-size: 4MB
performance.flush-behind: on
performance.write-behind: on
performance.io-thread-count: 32
performance.io-cache: on
network.ping-timeout: 2
nfs.addr-namelookup: off
performance.strict-write-ordering: on
[root at Server1 ~]#

[root at server3 brick2]# gluster volume heal Storage1 info
Brick Server1:/exp/br01/brick1/
Number of entries: 0

Brick Server2:/exp/br01/brick1/
Number of entries: 0

Brick Server3:/exp/br02/brick2/
Number of entries: 0

Brick Server2:/exp/br02/brick2/
Number of entries: 0

Gluster LOG ( there are a few errors but I'm not sure how to decipher them)

[2015-02-14 06:29:19.862809] I [MSGID: 106005] [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick Server1:/exp/br02/brick2 has disconnected from glusterd.
[2015-02-14 06:29:19.862836] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/7565ec897c6454bd3e2f4800250a7221.socket failed (Invalid argument)
[2015-02-14 06:29:19.862853] I [MSGID: 106006] [glusterd-handler.c:4257:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd.
[2015-02-14 06:29:19.953762] I [glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick /exp/br02/brick2 on port 49153
[2015-02-14 06:31:12.977450] I [glusterd-replace-brick.c:99:__glusterd_handle_replace_brick] 0-management: Received replace brick req
[2015-02-14 06:31:12.977495] I [glusterd-replace-brick.c:154:__glusterd_handle_replace_brick] 0-management: Received replace brick status request
[2015-02-14 06:31:13.048852] I [glusterd-replace-brick.c:1412:rb_update_srcbrick_port] 0-: adding src-brick port no
[2015-02-14 06:31:19.588380] I [glusterd-replace-brick.c:99:__glusterd_handle_replace_brick] 0-management: Received replace brick req
[2015-02-14 06:31:19.588422] I [glusterd-replace-brick.c:154:__glusterd_handle_replace_brick] 0-management: Received replace brick status request
[2015-02-14 06:31:19.661101] I [glusterd-replace-brick.c:1412:rb_update_srcbrick_port] 0-: adding src-brick port no
[2015-02-14 06:31:45.115355] W [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed
[2015-02-14 06:31:45.118597] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Storage1
[2015-02-14 06:32:10.956357] I [glusterd-replace-brick.c:99:__glusterd_handle_replace_brick] 0-management: Received replace brick req
[2015-02-14 06:32:10.956385] I [glusterd-replace-brick.c:154:__glusterd_handle_replace_brick] 0-management: Received replace brick commit request
[2015-02-14 06:32:11.028472] I [glusterd-replace-brick.c:1412:rb_update_srcbrick_port] 0-: adding src-brick port no
[2015-02-14 06:32:12.122552] I [glusterd-utils.c:6276:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV3 successfully
[2015-02-14 06:32:12.131836] I [glusterd-utils.c:6281:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV1 successfully
[2015-02-14 06:32:12.141107] I [glusterd-utils.c:6286:glusterd_nfs_pmap_deregister] 0-: De-registered NFSV3 successfully
[2015-02-14 06:32:12.150375] I [glusterd-utils.c:6291:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v4 successfully
[2015-02-14 06:32:12.159630] I [glusterd-utils.c:6296:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v1 successfully
[2015-02-14 06:32:12.168889] I [glusterd-utils.c:6301:glusterd_nfs_pmap_deregister] 0-: De-registered ACL v3 successfully
[2015-02-14 06:32:13.254689] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2015-02-14 06:32:13.254799] W [socket.c:2992:socket_connect] 0-management: Ignore failed connection attempt on , (No such file or directory)
[2015-02-14 06:32:13.257790] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2015-02-14 06:32:13.257908] W [socket.c:2992:socket_connect] 0-management: Ignore failed connection attempt on , (No such file or directory)
[2015-02-14 06:32:13.258031] W [socket.c:611:__socket_rwv] 0-socket.management: writev on 127.0.0.1:1019 failed (Broken pipe)
[2015-02-14 06:32:13.258111] W [socket.c:611:__socket_rwv] 0-socket.management: writev on 127.0.0.1:1021 failed (Broken pipe)
[2015-02-14 06:32:13.258130] W [socket.c:611:__socket_rwv] 0-socket.management: writev on 10.45.16.17:1018 failed (Broken pipe)
[2015-02-14 06:32:13.711948] I [mem-pool.c:545:mem_pool_destroy] 0-management: size=588 max=0 total=0
[2015-02-14 06:32:13.711967] I [mem-pool.c:545:mem_pool_destroy] 0-management: size=124 max=0 total=0
[2015-02-14 06:32:13.712008] I [mem-pool.c:545:mem_pool_destroy] 0-management: size=588 max=0 total=0
[2015-02-14 06:32:13.712021] I [mem-pool.c:545:mem_pool_destroy] 0-management: size=124 max=0 total=0
[2015-02-14 06:32:13.731311] I [mem-pool.c:545:mem_pool_destroy] 0-management: size=588 max=0 total=0
[2015-02-14 06:32:13.731326] I [mem-pool.c:545:mem_pool_destroy] 0-management: size=124 max=0 total=0
[2015-02-14 06:32:13.731356] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /exp/br02/brick2 on port 49153
[2015-02-14 06:32:13.823129] I [socket.c:2344:socket_event_handler] 0-transport: disconnecting now
[2015-02-14 06:32:13.840668] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/7565ec897c6454bd3e2f4800250a7221.socket failed (Invalid argument)
[2015-02-14 06:32:13.840693] I [MSGID: 106006] [glusterd-handler.c:4257:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd.
[2015-02-14 06:32:13.840712] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/ac4c043d3c6a2e5159c86e8c75c51829.socket failed (Invalid argument)
[2015-02-14 06:32:13.840728] I [MSGID: 106006] [glusterd-handler.c:4257:__glusterd_nodesvc_rpc_notify] 0-management: glustershd has disconnected from glusterd.
[2015-02-14 06:32:14.729667] E [glusterd-rpc-ops.c:1169:__glusterd_commit_op_cbk] 0-management: Received commit RJT from uuid: 294aa603-ec24-44b9-864b-0fe743faa8d9
[2015-02-14 06:32:14.743623] E [glusterd-rpc-ops.c:1169:__glusterd_commit_op_cbk] 0-management: Received commit RJT from uuid: 92aabaf4-4b6c-48da-82b6-c465aff2ec6d
[2015-02-14 06:32:18.762975] W [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed
[2015-02-14 06:32:18.764552] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Storage1
[2015-02-14 06:32:18.769051] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not match. Not aggregating tasks status.
[2015-02-14 06:32:18.769070] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2015-02-14 06:32:18.771095] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not match. Not aggregating tasks status.
[2015-02-14 06:32:18.771108] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2015-02-14 06:32:48.570796] W [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed
[2015-02-14 06:32:48.572352] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Storage1
[2015-02-14 06:32:48.576899] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not match. Not aggregating tasks status.
[2015-02-14 06:32:48.576918] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2015-02-14 06:32:48.578982] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not match. Not aggregating tasks status.
[2015-02-14 06:32:48.579001] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2015-02-14 06:36:57.840738] W [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed
[2015-02-14 06:36:57.842370] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Storage1
[2015-02-14 06:36:57.846919] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not match. Not aggregating tasks status.
[2015-02-14 06:36:57.846941] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2015-02-14 06:36:57.849026] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not match. Not aggregating tasks status.
[2015-02-14 06:36:57.849046] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2015-02-14 06:37:20.208081] W [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed
[2015-02-14 06:37:20.211279] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Storage1
[2015-02-14 06:37:20.215792] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not match. Not aggregating tasks status.
[2015-02-14 06:37:20.215809] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2015-02-14 06:37:20.216295] E [glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (0) and remote tasks count (1) do not match. Not aggregating tasks status.
[2015-02-14 06:37:20.216308] E [glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150214/0a46fa1e/attachment.html>