[Bugs] [Bug 1499520] New: Geo-rep: Lot of changelog(rsync) retries and failure during directory syncing.

Sun Oct 8 11:23:09 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1499520

            Bug ID: 1499520
           Summary: Geo-rep: Lot of changelog(rsync) retries and failure
                    during directory syncing.
           Product: GlusterFS
           Version: mainline
         Component: distribute
          Assignee: bugs at gluster.org
          Reporter: khiremat at redhat.com
                CC: bugs at gluster.org

Description of problem:
Directory syncing is causing changelog retries because of rsync errors on
directory gfids.

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Always

Steps to Reproduce:
1. Setup geo-replication
2. Create directory on master
3. Let geo-rep sync to slave

Actual results:
Directory syncing is causing retries. Directory is synced but lot of retries. 
I think rsync on directory gfid is failing. (rsync /mastermnt/.gfid/<dirgfid>
/slavemnt/)

Expected results:
Directory syncing should not involve any retries. It would cause performance
degradation

Analysis:
I found out that the issue is introduced with the below patch.

commit 9b4de61a136b8e5ba7bf0e48690cdb1292d0dee8
cluster/dht : User xattrs are not healed after brick stop/start

Additional info:

Geo-replication log:

[2017-10-08 11:12:31.657124] I [master(/bricks/brick0/b0):1382:crawl] _GMaster:
slave's time    stime=(1507460923, 0)
[2017-10-08 11:12:31.867773] I [master(/bricks/brick0/b0):1860:syncjob] Syncer:
Sync Time Taken duration=0.1440 num_files=1     job=1   return_code=23
[2017-10-08 11:12:31.871272] W [master(/bricks/brick0/b0):1267:process]
_GMaster: incomplete sync, retrying changelogs  files=['CHANGELOG.1507461149']
[2017-10-08 11:12:32.875348] I [master(/bricks/brick0/b0):1860:syncjob] Syncer:
Sync Time Taken duration=0.1471 num_files=1     job=3   return_code=23
[2017-10-08 11:12:32.880616] W [master(/bricks/brick0/b0):1267:process]
_GMaster: incomplete sync, retrying changelogs  files=['CHANGELOG.1507461149']
[2017-10-08 11:12:33.907165] I [master(/bricks/brick0/b0):1860:syncjob] Syncer:
Sync Time Taken duration=0.1759 num_files=1     job=2   return_code=23
[2017-10-08 11:12:33.908428] W [master(/bricks/brick0/b0):1267:process]
_GMaster: incomplete sync, retrying changelogs  files=['CHANGELOG.1507461149']
[2017-10-08 11:12:35.39695] I [master(/bricks/brick0/b0):1860:syncjob] Syncer:
Sync Time Taken  duration=0.1532 num_files=1     job=1   return_code=23
[2017-10-08 11:12:35.42959] W [master(/bricks/brick0/b0):1267:process]
_GMaster: incomplete sync, retrying changelogs   files=['CHANGELOG.1507461149']
[2017-10-08 11:12:36.42484] I [master(/bricks/brick0/b0):1860:syncjob] Syncer:
Sync Time Taken  duration=0.1516 num_files=1     job=3   return_code=23
[2017-10-08 11:12:36.47372] W [master(/bricks/brick0/b0):1267:process]
_GMaster: incomplete sync, retrying changelogs   files=['CHANGELOG.1507461149']
[2017-10-08 11:12:37.95174] I [master(/bricks/brick0/b0):1860:syncjob] Syncer:
Sync Time Taken  duration=0.1764 num_files=1     job=2   return_code=23
[2017-10-08 11:12:37.98913] W [master(/bricks/brick0/b0):1267:process]
_GMaster: incomplete sync, retrying changelogs   files=['CHANGELOG.1507461149']
[2017-10-08 11:12:38.209673] I [master(/bricks/brick0/b0):1860:syncjob] Syncer:
Sync Time Taken duration=0.1562 num_files=1     job=1   return_code=23
[2017-10-08 11:12:38.214554] W [master(/bricks/brick0/b0):1267:process]
_GMaster: incomplete sync, retrying changelogs  files=['CHANGELOG.1507461149']
[2017-10-08 11:12:39.236735] I [master(/bricks/brick0/b0):1860:syncjob] Syncer:
Sync Time Taken duration=0.1782 num_files=1     job=3   return_code=23
[2017-10-08 11:12:39.241420] W [master(/bricks/brick0/b0):1267:process]
_GMaster: incomplete sync, retrying changelogs  files=['CHANGELOG.1507461149']
[2017-10-08 11:12:40.267714] I [master(/bricks/brick0/b0):1860:syncjob] Syncer:
Sync Time Taken duration=0.1557 num_files=1     job=2   return_code=23
[2017-10-08 11:12:40.272130] W [master(/bricks/brick0/b0):1267:process]
_GMaster: incomplete sync, retrying changelogs  files=['CHANGELOG.1507461149']
[2017-10-08 11:12:41.384635] I [master(/bricks/brick0/b0):1860:syncjob] Syncer:
Sync Time Taken duration=0.1578 num_files=1     job=1   return_code=23
[2017-10-08 11:12:41.389205] E [master(/bricks/brick0/b0):1246:process]
_GMaster: changelogs could not be processed completely - moving on...  
files=['CHANGELOG.1507461149']
[2017-10-08 11:12:41.408358] I [master(/bricks/brick0/b0):1295:process]
_GMaster: Entry Time Taken      MKD=0   MKN=0   LIN=0   SYM=0   REN=0   RMD=0  
CRE=0   duration=0.0000 UNL=0
[2017-10-08 11:12:41.410573] I [master(/bricks/brick0/b0):1305:process]
_GMaster: Data/Metadata Time Taken      SETA=0  SETX=0  meta_duration=0.0000   
data_duration=1507461161.4104   DATA=0  XATT=0
[2017-10-08 11:12:41.412946] I [master(/bricks/brick0/b0):1315:process]
_GMaster: Batch Completed       changelog_end=1507461149       
entry_stime=(1507461148, 0)     changelog_start=1507461149     
stime=(1507461148, 0)   duration=9.7488 num_changelogs=1       
mode=live_changelog
/var/log/glusterfs/geo-replication/master/ssh\%3A\%2F\%2Froot\%40192.168.122.13\%3Agluster\%3A\%2F\%2F127.0.0.1\%3Aslave.log
(END)

Slave mount logs:

[2017-10-08 11:12:31.680512] E [MSGID: 101046]
[dht-common.c:702:dht_discover_complete] 0-slave-dht: dict is null
[2017-10-08 11:12:31.687599] I [dict.c:471:dict_get]
(-->/usr/local/lib/glusterfs/4.0dev/xlator/cluster/distribute.so(+0x28c31)
[0x7f4af9601c31]
-->/usr/local/lib/glusterfs/4.0dev/xlator/cluster/distribute.so(+0x28448)
[0x7f4af9601448] -->/usr/local/lib/libglusterfs.so.0(dict_get+0x79)
[0x7f4b0765662d] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid
argument]
/var/log/glusterfs/geo-replication-slaves/3ab6f770-d2a9-440f-ba28-83bcf1a40eb6:fedora3.\%2Fbricks\%2Fbrick0\%2Fb0.slave.gluster.log
(END)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.