[Gluster-users] GlusterFS geo-replication progress question

Thu Apr 2 00:08:36 UTC 2020

Hi all,

I have a running geo-replication session between two clusters and I'm 
trying to figure out what is the current progress of the replication and 
possibly how much longer it will take.

It has been running for quite a while now (> 1 month), but the thing is 
that both the hardware of the nodes and the link between the two 
clusters aren't that great (e.g., the volumes are backed by rotating 
disks) and the volume is somewhat sizeable (30-ish TB) and given these 
details I'm not really sure how long it is supposed to take normally.

I have several bricks in the volume (same brick size and physical layout 
in both clusters) that are now showing up with a Changelog Crawl status 
and with a recent LAST_SYNCED date in the `gluster colume 
geo-replication status detail` command output which seems to be the 
desired state for all bricks. The rest of the bricks though are in 
Hybrid Crawl state and have been in that state forever.

So I suppose my questions are - how can I tell if the replication 
session is somehow broken and if it's not, then is there are way for me 
to find out the progress and the ETA of the replication?

In /var/log/glusterfs/geo-replication/$session_dir/gsyncd.log there are 
some errors like:

[2020-03-31 11:48:47.81269] E [syncdutils(worker 
/data/gfs/store1/8/brick):822:errlog] Popen: command returned error 
cmd=ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i 
/var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto 
-S /tmp/gsync
d-aux-ssh-6aDWmc/206c4b2c3eb782ea2cf49ab5142bd68b.sock x.x.x.x 
/nonexistent/gsyncd slave <vol> x.x.x.x::<vol> --master-node x.x.x.x 
--master-node-id 9476b8bb-d7ee-489a-b083-875805343e67 --master-brick 
<brick_path> --local-node x.x.x.x
2 --local-node-id 426b564d-35d9-4291-980e-795903e9a386 --slave-timeout 
120 --slave-log-level INFO --slave-gluster-log-level INFO 
--slave-gluster-command-dir /usr/sbin    error=1
[2020-03-31 11:48:47.81617] E [syncdutils(worker 
<brick_path>):826:logerr] Popen: ssh> failed with ValueError.
[2020-03-31 11:48:47.390397] I [repce(agent 
<brick_path>):97:service_loop] RepceServer: terminating on reaching EOF.

In the brick logs I see stuff like:

[2020-03-29 07:49:05.338947] E [fuse-bridge.c:4167:fuse_xattr_cbk] 
0-glusterfs-fuse: extended attribute not supported by the backend storage

I don't know if these are critical, from the rest of the logs it looks 
like data is traveling between the clusters.

Any help will be greatly appreciated. Thank you in advance!

Best regards,
-- 
alexander iliev