[Gluster-users] Gluster replicate-brick issues (Distrubuted-Replica)

Sun Feb 15 18:54:15 UTC 2015

Of those missing files, are they maybe dht link files? Mode 1000, size 0.

On February 14, 2015 12:58:12 AM PST, Thomas Holkenbrink <thomas.holkenbrink at fibercloud.com> wrote:
>We have tried to migrate a Brick from one server to another using the
>following commands.   But the data is NOT being replicated... and the
>BRICK is not showing up anymore.
>Gluster still appears to be working but the Bricks are not balanced and
>I need to add the other Brick for Server3 that I don't want to do until
>after Server1:Brick2 gets replicated.
>
>This is the command to create the Original Volume:
>[root at Server1 ~]# gluster volume create Storage1 replica 2 transport
>tcp Server1:/exp/br01/brick1 Server2:/exp/br01/brick1
>Server1:/exp/br02/brick2 Server2:/exp/br02/brick2
>
>
>This is the Current configuration BEFORE the migration.. Server3 has
>been Peer Probed successfully but that has been it
>[root at Server1 ~]# gluster --version
>glusterfs 3.6.2 built on Jan 22 2015 12:58:11
>
>[root at Server1 ~]# gluster volume status
>Status of volume: Storage1
>Gluster process                 Port    Online  Pid
>------------------------------------------------------------------------------
>Brick Server1:/exp/br01/brick1  49152   Y       2167
>Brick Server2:/exp/br01/brick1  49152   Y       2192
>Brick Server1:/exp/br02/brick2  49153   Y       2172   <--- this is the
>one that goes missing
>Brick Server2:/exp/br02/brick2  49153   Y       2193
>NFS Server on localhost         2049    Y       2181
>Self-heal Daemon on localhost   N/A     Y       2186
>NFS Server on Server2           2049    Y       2205
>Self-heal Daemon on Server2     N/A     Y       2210
>NFS Server on Server3           2049    Y       6015
>Self-heal Daemon on Server3     N/A     Y       6016
>
>Task Status of Volume Storage1
>------------------------------------------------------------------------------
>There are no active volume tasks
>[root at Server1 ~]# gluster volume info
>
>Volume Name: Storage1
>Type: Distributed-Replicate
>Volume ID: 9616ce42-48bd-4fe3-883f-decd6c4fcd00
>Status: Started
>Number of Bricks: 2 x 2 = 4
>Transport-type: tcp
>Bricks:
>Brick1: Server1:/exp/br01/brick1
>Brick2: Server2:/exp/br01/brick1
>Brick3: Server1:/exp/br02/brick2
>Brick4: Server2:/exp/br02/brick2
>Options Reconfigured:
>diagnostics.brick-log-level: WARNING
>diagnostics.client-log-level: WARNING
>cluster.entry-self-heal: off
>cluster.data-self-heal: off
>cluster.metadata-self-heal: off
>performance.cache-size: 1024MB
>performance.cache-max-file-size: 2MB
>performance.cache-refresh-timeout: 1
>performance.stat-prefetch: off
>performance.read-ahead: on
>performance.quick-read: off
>performance.write-behind-window-size: 4MB
>performance.flush-behind: on
>performance.write-behind: on
>performance.io-thread-count: 32
>performance.io-cache: on
>network.ping-timeout: 2
>nfs.addr-namelookup: off
>performance.strict-write-ordering: on
>[root at Server1 ~]#
>
>
>
>So we start the Migration of the Brick to the new server using the
>replace Brick command
>[root at Server1 ~]# volname=Storage1
>
>[root at Server1 ~]# from=Server1:/exp/br02/brick2
>
>[root at Server1 ~]# to=Server3:/exp/br02/brick2
>
>[root at Server1 ~]# gluster volume replace-brick $volname $from $to start
>All replace-brick commands except commit force are deprecated. Do you
>want to continue? (y/n) y
>volume replace-brick: success: replace-brick started successfully
>ID: 0062d555-e7eb-4ebe-a264-7e0baf6e7546
>
>
>[root at Server1 ~]# gluster volume replace-brick $volname $from $to
>status
>All replace-brick commands except commit force are deprecated. Do you
>want to continue? (y/n) y
>volume replace-brick: success: Number of files migrated = 281  
>Migration complete
>
>At this point everything seems to be in order with no outstanding
>issues.
>
>[root at Server1 ~]# gluster volume status
>Status of volume: Storage1
>Gluster process                 Port    Online  Pid
>------------------------------------------------------------------------------
>Brick Server1:/exp/br01/brick1  49152   Y       2167
>Brick Server2:/exp/br01/brick1  49152   Y       2192
>Brick Server1:/exp/br02/brick2  49153   Y       27557
>Brick Server2:/exp/br02/brick2  49153   Y       2193
>NFS Server on localhost         2049    Y       27562
>Self-heal Daemon on localhost   N/A     Y       2186
>NFS Server on Server2           2049    Y       2205
>Self-heal Daemon on Server2     N/A     Y       2210
>NFS Server on Server3           2049    Y       6015
>Self-heal Daemon on Server3     N/A     Y       6016
>
>Task Status of Volume Storage1
>------------------------------------------------------------------------------
>Task                 : Replace brick
>ID                   : 0062d555-e7eb-4ebe-a264-7e0baf6e7546
>Source Brick         : Server1:/exp/br02/brick2
>Destination Brick    : Server3:/exp/br02/brick2
>Status               : completed
>
>The volume reports that the replace Brick command completed.. so the
>next step is to commit the change
>
>[root at Server1 ~]# gluster volume replace-brick $volname $from $to
>commit
>All replace-brick commands except commit force are deprecated. Do you
>want to continue? (y/n) y
>volume replace-brick: success: replace-brick commit successful
>
>At this point when I take a look at the status I see that the OLD brick
>is now missing (Server1:/exp/br02/brick2) AND I don't see the new
>Brick... WTF... panic!
>
>[root at Server1 ~]# gluster volume status
>Status of volume: Storage1
>Gluster process                                         Port    Online 
>Pid
>------------------------------------------------------------------------------
>Brick Server1:/exp/br01/brick1  49152   Y       2167
>Brick Server2:/exp/br01/brick1  49152   Y       2192
>Brick Server2:/exp/br02/brick2  49153   Y       2193
>NFS Server on localhost         2049    Y       28906
>Self-heal Daemon on localhost   N/A     Y       28911
>NFS Server on Server2           2049    Y       2205
>Self-heal Daemon on Server2     N/A     Y       2210
>NFS Server on Server3           2049    Y       6015
>Self-heal Daemon on Server3     N/A     Y       6016
>
>Task Status of Volume Storage1
>------------------------------------------------------------------------------
>There are no active volume tasks
>
>
>After the commit on Server1 it does not have the Tasks listed
>anymore... yet server2 and server3 see this
>
>[root at Server2 ~]# gluster volume status
>Status of volume: Storage1
>Gluster process                 Port    Online  Pid
>------------------------------------------------------------------------------
>Brick Server1:/exp/br01/brick1  49152   Y       2167
>Brick Server2:/exp/br01/brick1  49152   Y       2192
>Brick Server2:/exp/br02/brick2  49153   Y       2193
>NFS Server on localhost         2049    Y       2205
>Self-heal Daemon on localhost   N/A     Y       2210
>NFS Server on 10.45.16.17       2049    Y       28906
>Self-heal Daemon on 10.45.16.17 N/A     Y       28911
>NFS Server on server3           2049    Y       6015
>Self-heal Daemon on server3     N/A     Y       6016
>
>Task Status of Volume Storage1
>------------------------------------------------------------------------------
>Task                 : Replace brick
>ID                   : 0062d555-e7eb-4ebe-a264-7e0baf6e7546
>Source Brick         : Server1:/exp/br02/brick2
>Destination Brick    : server3:/exp/br02/brick2
>Status               : completed
>
>
>If I navigate the brick on Server3 the brick is NOT empty.. but missing
>A LOT!  It's like the replace brick stopped... and never restarted
>again.
>The replace brick reported back "Number of files migrated = 281  
>Migration complete" but when I look on Server3 Brick I get:
>       [root at Server3 brick2]# find . -type f -print | wc -l
>16
>
>I'm missing 265 files.. (they still exist on the OLD brick.. but how
>can I move it?)
>
>If I try to add the old brick back with another brick on the new server
>as such
>[root at Server1 ~]# gluster volume add-brick Storage1
>Server1:/exp/br02/brick2 Server3:/exp/br01/brick1
>volume add-brick: failed: /exp/br02/brick2 is already part of a volume
>
>Im fearfull of running:
>[root at Server1 ~]# setfattr -n trusted.glusterfs.volume-id -v 0x$(grep
>volume-id /var/lib/glusterd/vols/$volname/info | cut -d= -f2 | sed
>'s/-//g') /exp/br02/brick2
>Although it should allow me to add the brick
>
>Gluster Heal info returns:
>[root at Server2 ~]# gluster volume heal Storage1 info
>Brick Server1:/exp/br01/brick1/
>Number of entries: 0
>
>Brick Server2:/exp/br01/brick1/
>Number of entries: 0
>
>Brick Server1:/exp/br02/brick2
>Status: Transport endpoint is not connected
>
>Brick Server2:/exp/br02/brick2/
>Number of entries: 0
>
>I have restarted glusterd numerous times.
>
>
>at this time I'm not sure where to go from here... I know that the
>Server1:/exp/br02/brick2 still has all the data.. and
>Server3:/exp/br01/brick1 is not complete
>
>How do I actually get the brick to replicate?
>How can I add Server1:/exp/br02/brick2 back into the trusted pool if I
>can't replicate it, or re-add it?
>How can I fix this to get it back into a replicated state between the
>three servers?
>
>Thomas
>
>
>
>
>----DATA----
>
>Gluster volume info at this point
>[root at Server1 ~]# gluster volume info
>
>Volume Name: Storage1
>Type: Distributed-Replicate
>Volume ID: 9616ce42-48bd-4fe3-883f-decd6c4fcd00
>Status: Started
>Number of Bricks: 2 x 2 = 4
>Transport-type: tcp
>Bricks:
>Brick1: Server1:/exp/br01/brick1
>Brick2: Server2:/exp/br01/brick1
>Brick3: server3:/exp/br02/brick2
>Brick4: Server2:/exp/br02/brick2
>Options Reconfigured:
>diagnostics.brick-log-level: WARNING
>diagnostics.client-log-level: WARNING
>cluster.entry-self-heal: off
>cluster.data-self-heal: off
>cluster.metadata-self-heal: off
>performance.cache-size: 1024MB
>performance.cache-max-file-size: 2MB
>performance.cache-refresh-timeout: 1
>performance.stat-prefetch: off
>performance.read-ahead: on
>performance.quick-read: off
>performance.write-behind-window-size: 4MB
>performance.flush-behind: on
>performance.write-behind: on
>performance.io-thread-count: 32
>performance.io-cache: on
>network.ping-timeout: 2
>nfs.addr-namelookup: off
>performance.strict-write-ordering: on
>[root at Server1 ~]#
>
>[root at server3 brick2]# gluster volume heal Storage1 info
>Brick Server1:/exp/br01/brick1/
>Number of entries: 0
>
>Brick Server2:/exp/br01/brick1/
>Number of entries: 0
>
>Brick Server3:/exp/br02/brick2/
>Number of entries: 0
>
>Brick Server2:/exp/br02/brick2/
>Number of entries: 0
>
>
>Gluster LOG ( there are a few errors but I'm not sure how to decipher
>them)
>
>[2015-02-14 06:29:19.862809] I [MSGID: 106005]
>[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management:
>Brick Server1:/exp/br02/brick2 has disconnected from glusterd.
>[2015-02-14 06:29:19.862836] W [socket.c:611:__socket_rwv]
>0-management: readv on /var/run/7565ec897c6454bd3e2f4800250a7221.socket
>failed (Invalid argument)
>[2015-02-14 06:29:19.862853] I [MSGID: 106006]
>[glusterd-handler.c:4257:__glusterd_nodesvc_rpc_notify] 0-management:
>nfs has disconnected from glusterd.
>[2015-02-14 06:29:19.953762] I [glusterd-pmap.c:227:pmap_registry_bind]
>0-pmap: adding brick /exp/br02/brick2 on port 49153
>[2015-02-14 06:31:12.977450] I
>[glusterd-replace-brick.c:99:__glusterd_handle_replace_brick]
>0-management: Received replace brick req
>[2015-02-14 06:31:12.977495] I
>[glusterd-replace-brick.c:154:__glusterd_handle_replace_brick]
>0-management: Received replace brick status request
>[2015-02-14 06:31:13.048852] I
>[glusterd-replace-brick.c:1412:rb_update_srcbrick_port] 0-: adding
>src-brick port no
>[2015-02-14 06:31:19.588380] I
>[glusterd-replace-brick.c:99:__glusterd_handle_replace_brick]
>0-management: Received replace brick req
>[2015-02-14 06:31:19.588422] I
>[glusterd-replace-brick.c:154:__glusterd_handle_replace_brick]
>0-management: Received replace brick status request
>[2015-02-14 06:31:19.661101] I
>[glusterd-replace-brick.c:1412:rb_update_srcbrick_port] 0-: adding
>src-brick port no
>[2015-02-14 06:31:45.115355] W
>[glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx
>modification failed
>[2015-02-14 06:31:45.118597] I
>[glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management:
>Received status volume req for volume Storage1
>[2015-02-14 06:32:10.956357] I
>[glusterd-replace-brick.c:99:__glusterd_handle_replace_brick]
>0-management: Received replace brick req
>[2015-02-14 06:32:10.956385] I
>[glusterd-replace-brick.c:154:__glusterd_handle_replace_brick]
>0-management: Received replace brick commit request
>[2015-02-14 06:32:11.028472] I
>[glusterd-replace-brick.c:1412:rb_update_srcbrick_port] 0-: adding
>src-brick port no
>[2015-02-14 06:32:12.122552] I
>[glusterd-utils.c:6276:glusterd_nfs_pmap_deregister] 0-: De-registered
>MOUNTV3 successfully
>[2015-02-14 06:32:12.131836] I
>[glusterd-utils.c:6281:glusterd_nfs_pmap_deregister] 0-: De-registered
>MOUNTV1 successfully
>[2015-02-14 06:32:12.141107] I
>[glusterd-utils.c:6286:glusterd_nfs_pmap_deregister] 0-: De-registered
>NFSV3 successfully
>[2015-02-14 06:32:12.150375] I
>[glusterd-utils.c:6291:glusterd_nfs_pmap_deregister] 0-: De-registered
>NLM v4 successfully
>[2015-02-14 06:32:12.159630] I
>[glusterd-utils.c:6296:glusterd_nfs_pmap_deregister] 0-: De-registered
>NLM v1 successfully
>[2015-02-14 06:32:12.168889] I
>[glusterd-utils.c:6301:glusterd_nfs_pmap_deregister] 0-: De-registered
>ACL v3 successfully
>[2015-02-14 06:32:13.254689] I
>[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting
>frame-timeout to 600
>[2015-02-14 06:32:13.254799] W [socket.c:2992:socket_connect]
>0-management: Ignore failed connection attempt on , (No such file or
>directory)
>[2015-02-14 06:32:13.257790] I
>[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting
>frame-timeout to 600
>[2015-02-14 06:32:13.257908] W [socket.c:2992:socket_connect]
>0-management: Ignore failed connection attempt on , (No such file or
>directory)
>[2015-02-14 06:32:13.258031] W [socket.c:611:__socket_rwv]
>0-socket.management: writev on 127.0.0.1:1019 failed (Broken pipe)
>[2015-02-14 06:32:13.258111] W [socket.c:611:__socket_rwv]
>0-socket.management: writev on 127.0.0.1:1021 failed (Broken pipe)
>[2015-02-14 06:32:13.258130] W [socket.c:611:__socket_rwv]
>0-socket.management: writev on 10.45.16.17:1018 failed (Broken pipe)
>[2015-02-14 06:32:13.711948] I [mem-pool.c:545:mem_pool_destroy]
>0-management: size=588 max=0 total=0
>[2015-02-14 06:32:13.711967] I [mem-pool.c:545:mem_pool_destroy]
>0-management: size=124 max=0 total=0
>[2015-02-14 06:32:13.712008] I [mem-pool.c:545:mem_pool_destroy]
>0-management: size=588 max=0 total=0
>[2015-02-14 06:32:13.712021] I [mem-pool.c:545:mem_pool_destroy]
>0-management: size=124 max=0 total=0
>[2015-02-14 06:32:13.731311] I [mem-pool.c:545:mem_pool_destroy]
>0-management: size=588 max=0 total=0
>[2015-02-14 06:32:13.731326] I [mem-pool.c:545:mem_pool_destroy]
>0-management: size=124 max=0 total=0
>[2015-02-14 06:32:13.731356] I
>[glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick
>/exp/br02/brick2 on port 49153
>[2015-02-14 06:32:13.823129] I [socket.c:2344:socket_event_handler]
>0-transport: disconnecting now
>[2015-02-14 06:32:13.840668] W [socket.c:611:__socket_rwv]
>0-management: readv on /var/run/7565ec897c6454bd3e2f4800250a7221.socket
>failed (Invalid argument)
>[2015-02-14 06:32:13.840693] I [MSGID: 106006]
>[glusterd-handler.c:4257:__glusterd_nodesvc_rpc_notify] 0-management:
>nfs has disconnected from glusterd.
>[2015-02-14 06:32:13.840712] W [socket.c:611:__socket_rwv]
>0-management: readv on /var/run/ac4c043d3c6a2e5159c86e8c75c51829.socket
>failed (Invalid argument)
>[2015-02-14 06:32:13.840728] I [MSGID: 106006]
>[glusterd-handler.c:4257:__glusterd_nodesvc_rpc_notify] 0-management:
>glustershd has disconnected from glusterd.
>[2015-02-14 06:32:14.729667] E
>[glusterd-rpc-ops.c:1169:__glusterd_commit_op_cbk] 0-management:
>Received commit RJT from uuid: 294aa603-ec24-44b9-864b-0fe743faa8d9
>[2015-02-14 06:32:14.743623] E
>[glusterd-rpc-ops.c:1169:__glusterd_commit_op_cbk] 0-management:
>Received commit RJT from uuid: 92aabaf4-4b6c-48da-82b6-c465aff2ec6d
>[2015-02-14 06:32:18.762975] W
>[glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx
>modification failed
>[2015-02-14 06:32:18.764552] I
>[glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management:
>Received status volume req for volume Storage1
>[2015-02-14 06:32:18.769051] E
>[glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status]
>0-management: Local tasks count (0) and remote tasks count (1) do not
>match. Not aggregating tasks status.
>[2015-02-14 06:32:18.769070] E
>[glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed
>to aggregate response from  node/brick
>[2015-02-14 06:32:18.771095] E
>[glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status]
>0-management: Local tasks count (0) and remote tasks count (1) do not
>match. Not aggregating tasks status.
>[2015-02-14 06:32:18.771108] E
>[glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed
>to aggregate response from  node/brick
>[2015-02-14 06:32:48.570796] W
>[glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx
>modification failed
>[2015-02-14 06:32:48.572352] I
>[glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management:
>Received status volume req for volume Storage1
>[2015-02-14 06:32:48.576899] E
>[glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status]
>0-management: Local tasks count (0) and remote tasks count (1) do not
>match. Not aggregating tasks status.
>[2015-02-14 06:32:48.576918] E
>[glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed
>to aggregate response from  node/brick
>[2015-02-14 06:32:48.578982] E
>[glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status]
>0-management: Local tasks count (0) and remote tasks count (1) do not
>match. Not aggregating tasks status.
>[2015-02-14 06:32:48.579001] E
>[glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed
>to aggregate response from  node/brick
>[2015-02-14 06:36:57.840738] W
>[glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx
>modification failed
>[2015-02-14 06:36:57.842370] I
>[glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management:
>Received status volume req for volume Storage1
>[2015-02-14 06:36:57.846919] E
>[glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status]
>0-management: Local tasks count (0) and remote tasks count (1) do not
>match. Not aggregating tasks status.
>[2015-02-14 06:36:57.846941] E
>[glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed
>to aggregate response from  node/brick
>[2015-02-14 06:36:57.849026] E
>[glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status]
>0-management: Local tasks count (0) and remote tasks count (1) do not
>match. Not aggregating tasks status.
>[2015-02-14 06:36:57.849046] E
>[glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed
>to aggregate response from  node/brick
>[2015-02-14 06:37:20.208081] W
>[glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx
>modification failed
>[2015-02-14 06:37:20.211279] I
>[glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management:
>Received status volume req for volume Storage1
>[2015-02-14 06:37:20.215792] E
>[glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status]
>0-management: Local tasks count (0) and remote tasks count (1) do not
>match. Not aggregating tasks status.
>[2015-02-14 06:37:20.215809] E
>[glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed
>to aggregate response from  node/brick
>[2015-02-14 06:37:20.216295] E
>[glusterd-utils.c:9955:glusterd_volume_status_aggregate_tasks_status]
>0-management: Local tasks count (0) and remote tasks count (1) do not
>match. Not aggregating tasks status.
>[2015-02-14 06:37:20.216308] E
>[glusterd-syncop.c:961:_gd_syncop_commit_op_cbk] 0-management: Failed
>to aggregate response from  node/brick
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Gluster-users mailing list
>Gluster-users at gluster.org
>http://www.gluster.org/mailman/listinfo/gluster-users

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150215/828ac834/attachment.html>