[Bugs] [Bug 1764015] geo-rep syncing significantly behind and also only one of the directories are synced with tracebacks seen

bugzilla at redhat.com bugzilla at redhat.com
Tue Oct 22 07:34:55 UTC 2019


https://bugzilla.redhat.com/show_bug.cgi?id=1764015



--- Comment #1 from Kotresh HR <khiremat at redhat.com> ---
Description of problem:
=======================
had setup a georep session b/w a 4x3 master volume and 4x(4+2) ec volume.
I see below issues in my test-bed
1) the volume has two main directories, called IOs and logs, with IOs directory
being the place where all the workloads related IOs are happpening. logs
directory is hosting a dedicated file for each client which is collecting the
resource output every few minutes in append mode. The problem is till now, ie
after about 3 days, the logs directory hasn't even been created
2) the syncing has been very slow paced, even after 3 days, slave is yet to
catch up. Master had about 1.1TB data while slave has just about 350gb of data
3) I have seen some tracebacks in gsync log as below

/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714

[2019-07-13 12:26:53.408348] E [syncdutils(worker
/gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in
twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in
syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in
rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 12:26:53.490714] I [repce(agent
/gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on
reaching EOF.
[2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change status=Faulty
[2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change status=Initializing...



[root at rhs-gp-srv7 glusterfs]# #less
geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log


[2019-07-13 13:33:48.859147] I [master(worker
/gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync
 changelog
path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster-
brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888
[2019-07-13 13:40:39.412694] E [syncdutils(worker
/gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in
twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in
syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in
rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 13:40:39.484643] I [repce(agent
/gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on
reaching EOF.


Version-Release number of selected component (if applicable):
=====================
6.0.7
rsync-3.1.2-6.el7_6.1.x86_64



Steps to Reproduce:
====================
note: no brickmux enabled
1. created a 4x3 volume on 4 nodes , with below volume settings, which will act
as master in georep
Volume Name: nonfuncvol
Type: Distributed-Replicate
Volume ID: 4d44936f-312d-431a-905d-813e8ee63668
Status: Started
Snapshot Count: 1
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick2: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick3: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick4: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv02
Brick5: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02
Brick6: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02
Brick7: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03
Brick8: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03
Brick9: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv03
Brick10: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Brick11: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Brick12: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
features.barrier: disable
cluster.shd-max-threads: 24
client.event-threads: 8
server.event-threads: 8
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable



2. mounted the volume on 10 clients, started capturing resource info on clients
3. created another 3node cluster to be used as slave, with a 4x(4+2) ecvol as
slave
4. started IOs on clients of master, just linux untar 50 times from all clients
5. setup georep from master->slave
6. started georep only after about 4hrs so that master has some data to
propagate.
7. left the setup for weekend.



Actual results:
===================
seen below issues
1) the volume has two main directories, called IOs and logs, with IOs directory
being the place where all the workloads related IOs are happpening. logs
directory is hosting a dedicated file for each client which is collecting the
resource output every few minutes in append mode. The problem is till now, ie
after about 3 days, the logs directory hasn't even been created
2) the syncing has been very slow paced, even after 3 days, slave is yet to
catch up. Master had about 1.1TB data while slave has just about 350gb of data
3) I have seen some tracebacks in gsync log as below

/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714

[2019-07-13 12:26:53.408348] E [syncdutils(worker
/gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in
twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in
syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in
rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 12:26:53.490714] I [repce(agent
/gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on
reaching EOF.
[2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change status=Faulty
[2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change status=Initializing...



[root at rhs-gp-srv7 glusterfs]# #less
geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log


[2019-07-13 13:33:48.859147] I [master(worker
/gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync
 changelog
path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster-
brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888
[2019-07-13 13:40:39.412694] E [syncdutils(worker
/gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in
twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in
syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in
rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 13:40:39.484643] I [repce(agent
/gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on
reaching EOF.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the Bugs mailing list