[Gluster-users] no progress in geo-replication

Fri Mar 12 10:53:54 UTC 2021

Hi Shwetha,

thank you for your reply...

I ran a few tests in Debug Mode and found no real indication of the 
problem. After each start of the geo-replication some files are 
transferred at the beginning and then no further transfers.

Few minutes after start the amount of changelog files in <brick> looks 
like :

[ 06:42:26 ] - root at gl-master-02  ~/tmp $./var_gluster.sh
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processed 
: 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processing 
: 27                         ### growing whole time
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history 
: 324861
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processed 
: 1
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processing 
: 324859            ###finished building changelog files
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.current 
: 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.current 
: 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/xsync 
: 0
---

As far as i remember at the beginning i have seen a few changelog files 
in <brick>/.processed for a short moment, but always with size 0. Even 
after some hours there are no files in <brick>/.processed.

In strace are a lot of of messages like 'failed: No data available' and 
'rsync error: some files/attrs were not transfered ... (code 23)' for 
about the first 5-10 minutes after geo-rep start.

for example gfid 8d601e5b-180c.... :

19361 1615530800.812727 select(7, [6], [], [], NULL <unfinished ...>
19357 1615530800.812779 select(0, NULL, NULL, NULL, {tv_sec=0, 
tv_usec=235797} <unfinished ...>
19352 1615530800.816522 
lstat(".gfid/f0ed7d0e-83be-4c3f-b2c8-f763e9aada12", 
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.817723 
lstat(".gfid/c5b44852-9cf9-441b-8766-d87bfac774c8", 
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.819507 
lstat(".gfid/b0b71bcc-7653-4ab8-b863-a83d395f5e91", 
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.821106 
lstat(".gfid/c4e80ff5-2e08-4e68-9a4d-ea7f45ed290d", 
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.822874 
lstat(".gfid/aaa468d8-9d6a-4aaf-8344-c57440286f5c", <unfinished ...>
19418 1615530800.823466 <... restart_syscall resumed> ) = 1
19418 1615530800.823519 read(14, "rsync: get_xattr_data: 
lgetxattr(\"\"/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234\"\",\"trusted.glusterfs.mdata\",0) 
failed: No data available (61)\n", 32768) = 171
19418 1615530800.823587 poll([{fd=14, events=POLLIN}], 1, -1 <unfinished 
...>
19352 1615530800.823830 <... lstat resumed> {st_mode=S_IFREG|0644, 
st_size=4226767, ...}) = 0
19352 1615530800.823882 
lstat(".gfid/8164ea3b-44f6-4ea2-a75f-501cea0024cc", 
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.897938 
lstat(".gfid/01da73ae-1f88-498d-8fe5-84ea76db12f3", 
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.934281 
lstat(".gfid/be48f891-cdc1-4e4c-a141-7001ae3f592e", 
{st_mode=S_IFDIR|0777, st_size=4096, ...}) = 0
19352 1615530800.935938 
lstat(".gfid/501fab77-5e83-42cb-9edf-ce30bc3a86a9", 
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.937481 
lstat(".gfid/668f6bd1-bdb0-46e0-9cd4-c7ebc38fbaf9", <unfinished ...>
19417 1615530800.961937 <... restart_syscall resumed> ) = 1
19417 1615530800.962042 read(13, "rsync error: some files/attrs were not 
transferred (see previous errors) (code 23) at main.c(1196) 
[sender=3.1.2]\n", 32768) = 114

according gsyncd.log this gfid is candidate for syncing like many 
others, very few others are 'synced' :

[2021-03-12 06:33:20.651147] D [master(worker 
/brick1/mvol1):318:a_syncdata] _GMaster: candidate for syncing    
file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
[2021-03-12 06:35:17.419920] D [master(worker 
/brick1/mvol1):318:a_syncdata] _GMaster: candidate for syncing    
file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234

[2021-03-12 06:35:03.382977] D [master(worker /brick1/mvol1):324:regjob] 
_GMaster: synced file=.gfid/a3656075-784c-4377-a482-4aad8378acf0

when i try to fetch some attributes i get 'No such attribute' for the 
mentioned gfid while for a synced gfid attributes are available :

[ 09:58:54 ] - root at gl-master-02  ~/tmp $getfattr -m . -d -e hex 
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234: 
trusted.glusterfs.mdata: No such attribute

[ 09:59:38 ] - root at gl-master-02  ~/tmp $getfattr -m . -d -e hex 
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0
getfattr: Removing leading '/' from absolute path names
# file: 
tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0
trusted.glusterfs.mdata=0x010000000000000000000000005d1f73ff000000000da35668000000005d1f73fd0000000015811b46000000005f29050b000000001d5363a6

i can make a stat on that directory for example :

[ 10:07:19 ] - root at gl-master-02  ~/tmp $stat 
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
   File: 
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
   Size: 4096          Blocks: 8          IO Block: 131072 directory
Device: 37h/55d    Inode: 9394601311212820456  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: ( 0/    root)
Access: 2021-01-17 07:26:09.596743288 +0000
Modify: 2021-03-12 07:34:39.382122663 +0000
Change: 2021-03-12 07:34:39.383446790 +0000
  Birth: -

currently i have no clou how to get this fixed. obviously extended 
attributes are missing for the most entries in /tmp/gsyncd-aux-mount.../ 
while some others exist.

I believe rsync is not the reason, it's rather because of missing 
attributes ?!

Lastly the gfid points to a directory, when i try to get attributes for 
this directroy in brick-path it succeed's :

[ 10:15:25 ] - root at gl-master-02  ~/tmp $ls -l 
/brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234
lrwxrwxrwx 1 root root 56 Nov  5 17:54 
/brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234 -> 
../../f0/94/f094bf06-2806-4f90-9a79-489827c6cdf9/2217547

[ 10:38:01 ] - root at gl-master-02  ~ $getfattr -m . -d -e hex 
/brick1/mvol1/2137/files/20/11/2217547
getfattr: Removing leading '/' from absolute path names
# file: brick1/mvol1/2137/files/20/11/2217547
trusted.gfid=0x8d601e5b180c46c8b64fae6224542234
trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528
trusted.glusterfs.dht=0x001ed359000000007a2d37c1a8b9af89
trusted.glusterfs.dht.mds=0x00000000

[ 10:38:42 ] - root at gl-master-02  ~ $getfattr -m . -d -e hex 
/brick1/mvol1/2137/files/20/11
getfattr: Removing leading '/' from absolute path names
# file: brick1/mvol1/2137/files/20/11
trusted.gfid=0xf094bf0628064f909a79489827c6cdf9
trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528
trusted.glusterfs.dht=0x001ed35900000000d1738834ffffffff
trusted.glusterfs.mdata=0x010000000000000000000000005fc5378000000000077ba08a000000005fc535b60000000038d942cc000000005f9ffc610000000007b08744

[ 10:39:54 ] - root at gl-master-02  ~ $

but the geo-rep ended up in a loop, but without 'E'error :

[2021-03-12 10:46:40.572500] D [repce(worker 
/brick1/mvol1):215:__call__] RepceClient: call 
19352:140387951818496:1615546000.5609794 keep_alive -> 256
[2021-03-12 10:46:41.23154] D [master(worker 
/brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took 5.017846 
seconds
[2021-03-12 10:46:41.35729] D [master(worker 
/brick2/mvol1):578:crawlwrap] _GMaster: Crawl info 
cluster_stime=(1609281098, 0)    brick_stime=(1609281900, 0)
[2021-03-12 10:46:46.41012] D [master(worker 
/brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took 5.017512 
seconds
[2021-03-12 10:46:46.53818] D [master(worker 
/brick2/mvol1):578:crawlwrap] _GMaster: Crawl info 
cluster_stime=(1609281098, 0)    brick_stime=(1609281900, 0)
[2021-03-12 10:46:48.269174] D [repce(worker /brick2/mvol1):195:push] 
RepceClient: call 19354:140476158043904:1615546008.2690222 
keep_alive({'version': (1, 0), 'uuid': 
'2f5de6e4-66de-40a7-9f24-4762aad3ca96', 'retval': 0, 'volume_mark': 
(1609275788, 819193), 'timeout': 1615546128},) ...

Does anyone have any idea how to solve this problem ?

best regards,

Dietar

On 04.03.21 08:48, Shwetha Acharya wrote:
> Hi Dietmar,
>
> batch-fsync-delay-usec was already set to 0 and I increased the 
> sync_jobs from 3 to 6. In the moment I increased the sync_jobs 
> following error appeared in gsyncd.log :
>
>     [2021-03-03 23:17:46.59727] E [syncdutils(worker
>     /brick1/mvol1):312:log_raise_exception] <top>: connection to peer
>     is broken
>     [2021-03-03 23:17:46.59912] E [syncdutils(worker
>     /brick2/mvol1):312:log_raise_exception] <top>: connection to peer
>     is broken
>
> If the geo-rep session is currently not in faulty state, we should be 
> bothered about this log message. It is normal when the config is 
> updated, geo-rep restart occurs and the above message pops up.
>
>     passive nodes became active and the content in <brick>/.processing
>     was removed. currently new changelog files are created in this
>     directory.shortly before I changed the sync_jobs I have checked
>     the <brick>/.processing directory on the master nodes. the result
>     was the same for every master node.
>
>     since the last error about 12 hours ago nearly 2400 changelog
>     files were created on each node but it looks like none of them
>     were consumed.
>
>  Processed changelogs that are synced are archived under 
> <brick>/.processed directory. Verify if the latest file is created there.
>
>     in the moment I'm not sure what is right and what is
>     wrong...should at least the oldest changelog files in this
>     directory have been processed gradually ?
>
> Also you can try to set the log-level to debug for a while and set it 
> back to info(to avoid flooding of logs) and check the logs to get a 
> better picture of the scenario.
> #gluster volume geo-replication <primary> <ip>::<secondary> config 
> log-level DEBUG
> #gluster volume geo-replication <primary> <ip>::<secondary> config 
> log-level INFO
>
> Regards,
> Shwetha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210312/eb63f2ce/attachment.html>