[Gluster-users] Geo-replication broken in 3.4 alpha2?

Wed Mar 20 18:58:54 UTC 2013

Dear all,

I'm running GlusterFS 3.4 alpha2 together with oVirt 3.2. This is solely a test system and it doesn't have much data or anything important in it. Currently it has only 2 VM's running and disk usage is around 15 GB. I have been trying to set up a geo-replication for disaster recovery testing. For geo-replication I did following:

All machines are running CentOS 6.4 and using GlusterFS packages from http://download.gluster.org/pub/gluster/glusterfs/qa-releases/3.4.0alpha2/EPEL.repo/. Gluster bricks are using XFS. On slave I have tried ext4 and btrfs.

1. Installed slave machine (VM hosted in separate environment) with glusterfs-geo-replication, rsync and some other packages as needed by dependencies.
2. Installed glusterfs-geo-replication and rsync packages on GlusterFS server.
3. Created ssh key on server, saved it to /var/lib/glusterd/geo-replication/secret.pem and copied it to slave /root/.ssh/authorized_keys
4. On server ran: 
- gluster volume geo-replication vmstorage slave:/backup/vmstorage config remote_gsyncd /usr/libexec/glusterfs/gsyncd
- gluster volume geo-replication vmstorage slave:/backup/vmstorage start

After that geo-replication status was "starting…" for a while and after that it switched to "N/A". I set log-level to DEBUG and saw lines like these  appearing every 10 seconds:
[2013-03-20 18:48:19.417107] D [repce:175:push] RepceClient: call 27756:140178941277952:1363798099.42 keep_alive(None,) ...
[2013-03-20 18:48:19.418431] D [repce:190:__call__] RepceClient: call 27756:140178941277952:1363798099.42 keep_alive -> 34
[2013-03-20 18:48:29.427959] D [repce:175:push] RepceClient: call 27756:140178941277952:1363798109.43 keep_alive(None,) ...
[2013-03-20 18:48:29.429172] D [repce:190:__call__] RepceClient: call 27756:140178941277952:1363798109.43 keep_alive -> 35

I thought that maybe it's creating index or something like that let it run for about 30 hours. Still after that there was no new log messages and no data being transferred to slave. I tried using strace -p 27756 to see what was going on but there was no output at all. My next thought was that maybe running virtual machines are causing some trouble so I shut down all VMs and restarted geo-replication but it didn't have any effect. My last effort was to crete new clean volume without any data in it and try geo-replication with it - no luck there either.

I also did quick test with master running GlusterFS 3.3.1 and it had no problems copying data to exactly same slave server.

There isn't much documentation available about geo-replication and before filing a bug report I'd like to hear if anyone else has used geo-replication successfully with 3.4 alpha orif I'm missing something obvious.

Output of gluster volume info:
Volume Name: vmstorage
Type: Distributed-Replicate
Volume ID: a800e5b7-089e-4b55-9515-c9cc72502aea
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: mc1.ovirt.local:/gluster/brick0/vmstorage
Brick2: mc5.ovirt.local:/gluster/brick0/vmstorage
Brick3: mc1.ovirt.local:/gluster/brick1/vmstorage
Brick4: mc5.ovirt.local:/gluster/brick1/vmstorage
Options Reconfigured:
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
network.remote-dio: enable
geo-replication.indexing: on
storage.owner-uid: 36
storage.owner-gid: 36
network.ping-timeout: 10
nfs.disable: on

Best regards,
Samuli Heinonen