[Gluster-users] Problem when rebooting geo-replication slave

Tue Jan 21 14:20:15 UTC 2014

Hi list,

I have a problem when a geo-replicating slave has to be rebooted.
After reboot the slave is out of sync and the gluster demon fails to 
even start.
I have a workaround procedure that seems to work but it seems I must be 
doing something wrong or missing out on something.

I am currently using gluster 3.4.0 with the following setup.

Two replicating masters: fe and ni

One geo-replicating slave with periodic snapshots in zfs: nitinol

 From master fe I have successfully setup geo-replication with:
gluster volume geo-replication gvarchive 
nitinol:/zfspool/gluster/gvarchive start

All is fine... not really...

When slave nitinol is rebooted it becomes broken.

service glusterfs-server start  # fails - the demon does not start with 
following log entry:

[2014-01-21 13:14:53.352007] I [glusterfsd.c:1910:main] 
0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.4.2 
(/usr/sbin/glusterd -p /var/run/glusterd.pid)
[2014-01-21 13:14:53.354316] I [glusterd.c:961:init] 0-management: Using 
/var/lib/glusterd as working directory
[2014-01-21 13:14:53.356431] I [socket.c:3480:socket_init] 
0-socket.management: SSL support is NOT enabled
[2014-01-21 13:14:53.356490] I [socket.c:3495:socket_init] 
0-socket.management: using system polling thread
[2014-01-21 13:14:53.357999] W [rdma.c:4197:__gf_rdma_ctx_create] 
0-rpc-transport/rdma: rdma_cm event channel creation failed (No such device)
[2014-01-21 13:14:53.358055] E [rdma.c:4485:init] 0-rdma.management: 
Failed to initialize IB Device
[2014-01-21 13:14:53.358136] E [rpc-transport.c:320:rpc_transport_load] 
0-rpc-transport: 'rdma' initialization failed
[2014-01-21 13:14:53.358185] W [rpcsvc.c:1389:rpcsvc_transport_create] 
0-rpc-service: cannot create listener, initing the transport failed
[2014-01-21 13:14:55.083839] I 
[glusterd-store.c:1339:glusterd_restore_op_version] 0-glusterd: 
retrieved op-version: 2
[2014-01-21 13:14:55.092907] E 
[glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: 
brick-0
[2014-01-21 13:14:55.093002] E 
[glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: 
brick-1
.....
[2014-01-21 13:14:55.741895] E 
[glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: 
brick-0
[2014-01-21 13:14:55.741989] E 
[glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: 
brick-1
[2014-01-21 13:14:55.792063] I 
[glusterd-handler.c:2818:glusterd_friend_add] 0-management: connect 
returned 0
[2014-01-21 13:14:55.792258] I [rpc-clnt.c:962:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600
[2014-01-21 13:14:55.792416] I [socket.c:3480:socket_init] 0-management: 
SSL support is NOT enabled
[2014-01-21 13:14:55.792443] I [socket.c:3495:socket_init] 0-management: 
using system polling thread
[2014-01-21 13:14:55.796485] E 
[glusterd-store.c:2487:glusterd_resolve_all_bricks] 0-glusterd: resolve 
brick failed in restore
[2014-01-21 13:14:55.796546] E [xlator.c:390:xlator_init] 0-management: 
Initialization of volume 'management' failed, review your volfile again
[2014-01-21 13:14:55.796574] E [graph.c:292:glusterfs_graph_init] 
0-management: initializing translator failed
[2014-01-21 13:14:55.796596] E [graph.c:479:glusterfs_graph_activate] 
0-graph: init failed
[2014-01-21 13:14:55.797136] W [glusterfsd.c:1002:cleanup_and_exit] 
(-->/usr/sbin/glusterd(main+0x3cd) [0x7f737c1fb85d] 
(-->/usr/sbin/glusterd(glusterfs_volumes_init+0xc0) [0x7f737c1fe650] 
(-->/usr/sbin/glusterd(glusterfs_process_volfp+0x103) 
[0x7f737c1fe553]))) 0-: received signum (0), shutting down

I have successfully corrected the situation by the following procedure:

# on slave:
rm -rf /var/lib/glusterd/vols

# on master
gluster volume geo-replication gvarchive 
nitinol:/zfspool/gluster/gvarchive stop
gluster peer detach nitinol

# on slave:
service glusterfs-server start

# on master:
gluster peer probe nitinol
gluster volume geo-replication gvarchive 
nitinol:/zfspool/gluster/gvarchive start

This does not seem correct.
Why does the volumes get out of sync?

Regards

Hans Höök