[Bugs] [Bug 1764174] New: geo-rep syncing significantly behind and also only one of the directories are synced with tracebacks seen

Tue Oct 22 12:28:30 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1764174

            Bug ID: 1764174
           Summary: geo-rep syncing significantly behind and also only one
                    of the directories are synced with tracebacks seen
           Product: GlusterFS
           Version: 6
            Status: NEW
         Component: geo-replication
          Keywords: Regression
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: khiremat at redhat.com
                CC: amukherj at redhat.com, avishwan at redhat.com,
                    bugs at gluster.org, csaba at redhat.com,
                    hgowtham at redhat.com, khiremat at redhat.com,
                    ksubrahm at redhat.com, moagrawa at redhat.com,
                    nchilaka at redhat.com, rgowdapp at redhat.com,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    sheggodu at redhat.com, spalai at redhat.com,
                    storage-qa-internal at redhat.com, sunkumar at redhat.com
        Depends On: 1729915, 1737484
            Blocks: 1764015
  Target Milestone: ---
    Classification: Community

++ This bug was initially created as a clone of Bug #1737484 +++

+++ This bug was initially created as a clone of Bug #1729915 +++

Description of problem:
=======================
had setup a georep session b/w a 4x3 master volume and 4x(4+2) ec volume.
I see below issues in my test-bed
1) the volume has two main directories, called IOs and logs, with IOs directory
being the place where all the workloads related IOs are happpening. logs
directory is hosting a dedicated file for each client which is collecting the
resource output every few minutes in append mode. The problem is till now, ie
after about 3 days, the logs directory hasn't even been created
2) the syncing has been very slow paced, even after 3 days, slave is yet to
catch up. Master had about 1.1TB data while slave has just about 350gb of data
3) I have seen some tracebacks in gsync log as below

/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714

[2019-07-13 12:26:53.408348] E [syncdutils(worker
/gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in
twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in
syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in
rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 12:26:53.490714] I [repce(agent
/gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on
reaching EOF.
[2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change status=Faulty
[2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change status=Initializing...

[root at rhs-gp-srv7 glusterfs]# #less
geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log

[2019-07-13 13:33:48.859147] I [master(worker
/gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync
 changelog
path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster-
brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888
[2019-07-13 13:40:39.412694] E [syncdutils(worker
/gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in
twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in
syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in
rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 13:40:39.484643] I [repce(agent
/gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on
reaching EOF.

Version-Release number of selected component (if applicable):
=====================
6.0.7
rsync-3.1.2-6.el7_6.1.x86_64

Steps to Reproduce:
====================
note: no brickmux enabled
1. created a 4x3 volume on 4 nodes , with below volume settings, which will act
as master in georep
Volume Name: nonfuncvol
Type: Distributed-Replicate
Volume ID: 4d44936f-312d-431a-905d-813e8ee63668
Status: Started
Snapshot Count: 1
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick2: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick3: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick4: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv02
Brick5: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02
Brick6: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02
Brick7: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03
Brick8: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03
Brick9: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv03
Brick10: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Brick11: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Brick12: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
features.barrier: disable
cluster.shd-max-threads: 24
client.event-threads: 8
server.event-threads: 8
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable

2. mounted the volume on 10 clients, started capturing resource info on clients
3. created another 3node cluster to be used as slave, with a 4x(4+2) ecvol as
slave
4. started IOs on clients of master, just linux untar 50 times from all clients
5. setup georep from master->slave
6. started georep only after about 4hrs so that master has some data to
propagate.
7. left the setup for weekend.

Actual results:
===================
seen below issues
1) the volume has two main directories, called IOs and logs, with IOs directory
being the place where all the workloads related IOs are happpening. logs
directory is hosting a dedicated file for each client which is collecting the
resource output every few minutes in append mode. The problem is till now, ie
after about 3 days, the logs directory hasn't even been created
2) the syncing has been very slow paced, even after 3 days, slave is yet to
catch up. Master had about 1.1TB data while slave has just about 350gb of data
3) I have seen some tracebacks in gsync log as below

/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714

[2019-07-13 12:26:53.408348] E [syncdutils(worker
/gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in
twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in
syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in
rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 12:26:53.490714] I [repce(agent
/gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on
reaching EOF.
[2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change status=Faulty
[2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change status=Initializing...

[root at rhs-gp-srv7 glusterfs]# #less
geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log

[2019-07-13 13:33:48.859147] I [master(worker
/gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync
 changelog
path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster-
brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888
[2019-07-13 13:40:39.412694] E [syncdutils(worker
/gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in
twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in
syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in
rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 13:40:39.484643] I [repce(agent
/gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on
reaching EOF.

--- Additional comment from RHEL Product and Program Management on 2019-07-15
10:18:53 UTC ---

This bug is automatically being proposed for the next minor release of Red Hat
Gluster Storage by setting the release flag 'rhgs‑3.5.0' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from nchilaka on 2019-07-15 10:21:43 UTC ---

proposing as blocker, as syncing is falling behind significantly, and also
tracebacks seen. Can revisit based on RC from dev.

^C
[root at rhs-gp-srv5 bricks]#  date;gluster volume geo-replication  nonfuncvol
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave status
Mon Jul 15 15:50:57 IST 2019

MASTER NODE                           MASTER VOL    MASTER BRICK               
       SLAVE USER    SLAVE                                                   
SLAVE NODE                             STATUS     CRAWL STATUS    LAST_SYNCED   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rhs-gp-srv5.lab.eng.blr.redhat.com    nonfuncvol   
/gluster/brick1/nonfuncvol-sv01    root         
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave   
rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A           
rhs-gp-srv5.lab.eng.blr.redhat.com    nonfuncvol   
/gluster/brick2/nonfuncvol-sv02    root         
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave   
rhs-gp-srv16.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A           
rhs-gp-srv5.lab.eng.blr.redhat.com    nonfuncvol   
/gluster/brick3/nonfuncvol-sv03    root         
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave   
rhs-gp-srv13.lab.eng.blr.redhat.com    Passive    N/A             N/A           
rhs-gp-srv7.lab.eng.blr.redhat.com    nonfuncvol   
/gluster/brick1/nonfuncvol-sv01    root         
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave   
rhs-gp-srv13.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A           
rhs-gp-srv7.lab.eng.blr.redhat.com    nonfuncvol   
/gluster/brick2/nonfuncvol-sv03    root         
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave   
rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A           
rhs-gp-srv7.lab.eng.blr.redhat.com    nonfuncvol   
/gluster/brick3/nonfuncvol-sv04    root         
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave   
rhs-gp-srv16.lab.eng.blr.redhat.com    Passive    N/A             N/A           
rhs-gp-srv6.lab.eng.blr.redhat.com    nonfuncvol   
/gluster/brick1/nonfuncvol-sv01    root         
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave   
rhs-gp-srv16.lab.eng.blr.redhat.com    Passive    N/A             N/A           
rhs-gp-srv6.lab.eng.blr.redhat.com    nonfuncvol   
/gluster/brick2/nonfuncvol-sv02    root         
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave   
rhs-gp-srv13.lab.eng.blr.redhat.com    Passive    N/A             N/A           
rhs-gp-srv6.lab.eng.blr.redhat.com    nonfuncvol   
/gluster/brick3/nonfuncvol-sv04    root         
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave   
rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A           
rhs-gp-srv8.lab.eng.blr.redhat.com    nonfuncvol   
/gluster/brick1/nonfuncvol-sv02    root         
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave   
rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A           
rhs-gp-srv8.lab.eng.blr.redhat.com    nonfuncvol   
/gluster/brick2/nonfuncvol-sv03    root         
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave   
rhs-gp-srv16.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A           
rhs-gp-srv8.lab.eng.blr.redhat.com    nonfuncvol   
/gluster/brick3/nonfuncvol-sv04    root         
rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave   
rhs-gp-srv13.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A           
[root at rhs-gp-srv5 bricks]# 

slave volinfo

Volume Name: gluster_shared_storage
Type: Replicate
Volume ID: 34c52663-f47b-42e5-a33c-abe5d16382a8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv16.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Brick3: rhs-gp-srv13.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable

Volume Name: nonfuncvol-slave
Type: Distributed-Disperse
Volume ID: b5753c86-ea76-4e0e-8306-acc1d5237ced
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (4 + 2) = 24
Transport-type: tcp
Bricks:
Brick1:
rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1
Brick2:
rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1
Brick3:
rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1
Brick4:
rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1
Brick5:
rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1
Brick6:
rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1
Brick7:
rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2
Brick8:
rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2
Brick9:
rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2
Brick10:
rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2
Brick11:
rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2
Brick12:
rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2
Brick13:
rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3
Brick14:
rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3
Brick15:
rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3
Brick16:
rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3
Brick17:
rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3
Brick18:
rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3
Brick19:
rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4
Brick20:
rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4
Brick21:
rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4
Brick22:
rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4
Brick23:
rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4
Brick24:
rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4
Options Reconfigured:
features.read-only: on
performance.quick-read: off
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
cluster.enable-shared-storage: enable

--- Additional comment from nchilaka on 2019-07-15 10:46:07 UTC ---

sosreports and logs @
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1729915/

--- Additional comment from nchilaka on 2019-07-15 11:37:59 UTC ---

do let me know if you need the setup, ASAP{can wait till EOD} else I would go
ahead with further testing which may involve reconfiguring part/complete of
testbed.

--- Additional comment from Atin Mukherjee on 2019-07-15 12:21:25 UTC ---

Hari - since Sunny is on PTO for this week and this is proposed as a blocker,
can you please work on this bug? Please don't hesitate to contact
Kotresh/Aravinda should you need any help.

Nag - Please note that Sunny is on PTO, so we'd have to expect some delay in
picking this up, till then don't destroy the setup.

--- Additional comment from Atin Mukherjee on 2019-07-15 13:37:05 UTC ---

(In reply to Atin Mukherjee from comment #5)
> Hari - since Sunny is on PTO for this week and this is proposed as a
> blocker, can you please work on this bug? Please don't hesitate to contact
> Kotresh/Aravinda should you need any help.

I see that Hari is on PTO as well till 17th.

Aravinda - would you be able to assist here? Kotresh has couple of bugs in his
plate which he's focusing on and hence requested for your help.

> 
> Nag - Please note that Sunny is on PTO, so we'd have to expect some delay in
> picking this up, till then don't destroy the setup.

--- Additional comment from Rochelle on 2019-07-16 05:30:24 UTC ---

I'm seeing this while running automation on the latest builds as well:

[2019-07-15 10:31:26.713311] E [syncdutils(worker
/bricks/brick0/master_brick0):338:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in
twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in
syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in
rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'

There was no functionality impact in my case.

However there were additional 'No such file or directory' messages in brick
logs:
mnt-bricks-brick0-master_brick1.log-20190716:[2019-07-15 11:19:49.950747] E
[fuse-bridge.c:220:check_and_dump_fuse_W] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fac24305b3b] (-->

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1729915
[Bug 1729915] geo-rep syncing significantly behind and also only one of the
directories are synced with tracebacks seen
https://bugzilla.redhat.com/show_bug.cgi?id=1737484
[Bug 1737484] geo-rep syncing significantly behind and also only one of the
directories are synced with tracebacks seen
https://bugzilla.redhat.com/show_bug.cgi?id=1764015
[Bug 1764015] geo-rep syncing significantly behind and also only one of the
directories are synced with tracebacks seen
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.