[Gluster-users] no progress in geo-replication
Dietmar Putz
dietmar.putz at 3q.video
Fri Mar 12 10:53:54 UTC 2021
Hi Shwetha,
thank you for your reply...
I ran a few tests in Debug Mode and found no real indication of the
problem. After each start of the geo-replication some files are
transferred at the beginning and then no further transfers.
Few minutes after start the amount of changelog files in <brick> looks
like :
[ 06:42:26 ] - root at gl-master-02 ~/tmp $./var_gluster.sh
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processed
: 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processing
: 27 ### growing whole time
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history
: 324861
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processed
: 1
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processing
: 324859 ###finished building changelog files
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.current
: 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.current
: 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/xsync
: 0
---
As far as i remember at the beginning i have seen a few changelog files
in <brick>/.processed for a short moment, but always with size 0. Even
after some hours there are no files in <brick>/.processed.
In strace are a lot of of messages like 'failed: No data available' and
'rsync error: some files/attrs were not transfered ... (code 23)' for
about the first 5-10 minutes after geo-rep start.
for example gfid 8d601e5b-180c.... :
19361 1615530800.812727 select(7, [6], [], [], NULL <unfinished ...>
19357 1615530800.812779 select(0, NULL, NULL, NULL, {tv_sec=0,
tv_usec=235797} <unfinished ...>
19352 1615530800.816522
lstat(".gfid/f0ed7d0e-83be-4c3f-b2c8-f763e9aada12",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.817723
lstat(".gfid/c5b44852-9cf9-441b-8766-d87bfac774c8",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.819507
lstat(".gfid/b0b71bcc-7653-4ab8-b863-a83d395f5e91",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.821106
lstat(".gfid/c4e80ff5-2e08-4e68-9a4d-ea7f45ed290d",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.822874
lstat(".gfid/aaa468d8-9d6a-4aaf-8344-c57440286f5c", <unfinished ...>
19418 1615530800.823466 <... restart_syscall resumed> ) = 1
19418 1615530800.823519 read(14, "rsync: get_xattr_data:
lgetxattr(\"\"/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234\"\",\"trusted.glusterfs.mdata\",0)
failed: No data available (61)\n", 32768) = 171
19418 1615530800.823587 poll([{fd=14, events=POLLIN}], 1, -1 <unfinished
...>
19352 1615530800.823830 <... lstat resumed> {st_mode=S_IFREG|0644,
st_size=4226767, ...}) = 0
19352 1615530800.823882
lstat(".gfid/8164ea3b-44f6-4ea2-a75f-501cea0024cc",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.897938
lstat(".gfid/01da73ae-1f88-498d-8fe5-84ea76db12f3",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.934281
lstat(".gfid/be48f891-cdc1-4e4c-a141-7001ae3f592e",
{st_mode=S_IFDIR|0777, st_size=4096, ...}) = 0
19352 1615530800.935938
lstat(".gfid/501fab77-5e83-42cb-9edf-ce30bc3a86a9",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.937481
lstat(".gfid/668f6bd1-bdb0-46e0-9cd4-c7ebc38fbaf9", <unfinished ...>
19417 1615530800.961937 <... restart_syscall resumed> ) = 1
19417 1615530800.962042 read(13, "rsync error: some files/attrs were not
transferred (see previous errors) (code 23) at main.c(1196)
[sender=3.1.2]\n", 32768) = 114
according gsyncd.log this gfid is candidate for syncing like many
others, very few others are 'synced' :
[2021-03-12 06:33:20.651147] D [master(worker
/brick1/mvol1):318:a_syncdata] _GMaster: candidate for syncing
file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
[2021-03-12 06:35:17.419920] D [master(worker
/brick1/mvol1):318:a_syncdata] _GMaster: candidate for syncing
file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
[2021-03-12 06:35:03.382977] D [master(worker /brick1/mvol1):324:regjob]
_GMaster: synced file=.gfid/a3656075-784c-4377-a482-4aad8378acf0
when i try to fetch some attributes i get 'No such attribute' for the
mentioned gfid while for a synced gfid attributes are available :
[ 09:58:54 ] - root at gl-master-02 ~/tmp $getfattr -m . -d -e hex
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234:
trusted.glusterfs.mdata: No such attribute
[ 09:59:38 ] - root at gl-master-02 ~/tmp $getfattr -m . -d -e hex
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0
getfattr: Removing leading '/' from absolute path names
# file:
tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0
trusted.glusterfs.mdata=0x010000000000000000000000005d1f73ff000000000da35668000000005d1f73fd0000000015811b46000000005f29050b000000001d5363a6
i can make a stat on that directory for example :
[ 10:07:19 ] - root at gl-master-02 ~/tmp $stat
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
File:
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
Size: 4096 Blocks: 8 IO Block: 131072 directory
Device: 37h/55d Inode: 9394601311212820456 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-01-17 07:26:09.596743288 +0000
Modify: 2021-03-12 07:34:39.382122663 +0000
Change: 2021-03-12 07:34:39.383446790 +0000
Birth: -
currently i have no clou how to get this fixed. obviously extended
attributes are missing for the most entries in /tmp/gsyncd-aux-mount.../
while some others exist.
I believe rsync is not the reason, it's rather because of missing
attributes ?!
Lastly the gfid points to a directory, when i try to get attributes for
this directroy in brick-path it succeed's :
[ 10:15:25 ] - root at gl-master-02 ~/tmp $ls -l
/brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234
lrwxrwxrwx 1 root root 56 Nov 5 17:54
/brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234 ->
../../f0/94/f094bf06-2806-4f90-9a79-489827c6cdf9/2217547
[ 10:38:01 ] - root at gl-master-02 ~ $getfattr -m . -d -e hex
/brick1/mvol1/2137/files/20/11/2217547
getfattr: Removing leading '/' from absolute path names
# file: brick1/mvol1/2137/files/20/11/2217547
trusted.gfid=0x8d601e5b180c46c8b64fae6224542234
trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528
trusted.glusterfs.dht=0x001ed359000000007a2d37c1a8b9af89
trusted.glusterfs.dht.mds=0x00000000
[ 10:38:42 ] - root at gl-master-02 ~ $getfattr -m . -d -e hex
/brick1/mvol1/2137/files/20/11
getfattr: Removing leading '/' from absolute path names
# file: brick1/mvol1/2137/files/20/11
trusted.gfid=0xf094bf0628064f909a79489827c6cdf9
trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528
trusted.glusterfs.dht=0x001ed35900000000d1738834ffffffff
trusted.glusterfs.mdata=0x010000000000000000000000005fc5378000000000077ba08a000000005fc535b60000000038d942cc000000005f9ffc610000000007b08744
[ 10:39:54 ] - root at gl-master-02 ~ $
but the geo-rep ended up in a loop, but without 'E'error :
[2021-03-12 10:46:40.572500] D [repce(worker
/brick1/mvol1):215:__call__] RepceClient: call
19352:140387951818496:1615546000.5609794 keep_alive -> 256
[2021-03-12 10:46:41.23154] D [master(worker
/brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took 5.017846
seconds
[2021-03-12 10:46:41.35729] D [master(worker
/brick2/mvol1):578:crawlwrap] _GMaster: Crawl info
cluster_stime=(1609281098, 0) brick_stime=(1609281900, 0)
[2021-03-12 10:46:46.41012] D [master(worker
/brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took 5.017512
seconds
[2021-03-12 10:46:46.53818] D [master(worker
/brick2/mvol1):578:crawlwrap] _GMaster: Crawl info
cluster_stime=(1609281098, 0) brick_stime=(1609281900, 0)
[2021-03-12 10:46:48.269174] D [repce(worker /brick2/mvol1):195:push]
RepceClient: call 19354:140476158043904:1615546008.2690222
keep_alive({'version': (1, 0), 'uuid':
'2f5de6e4-66de-40a7-9f24-4762aad3ca96', 'retval': 0, 'volume_mark':
(1609275788, 819193), 'timeout': 1615546128},) ...
Does anyone have any idea how to solve this problem ?
best regards,
Dietar
On 04.03.21 08:48, Shwetha Acharya wrote:
> Hi Dietmar,
>
> batch-fsync-delay-usec was already set to 0 and I increased the
> sync_jobs from 3 to 6. In the moment I increased the sync_jobs
> following error appeared in gsyncd.log :
>
> [2021-03-03 23:17:46.59727] E [syncdutils(worker
> /brick1/mvol1):312:log_raise_exception] <top>: connection to peer
> is broken
> [2021-03-03 23:17:46.59912] E [syncdutils(worker
> /brick2/mvol1):312:log_raise_exception] <top>: connection to peer
> is broken
>
> If the geo-rep session is currently not in faulty state, we should be
> bothered about this log message. It is normal when the config is
> updated, geo-rep restart occurs and the above message pops up.
>
> passive nodes became active and the content in <brick>/.processing
> was removed. currently new changelog files are created in this
> directory.shortly before I changed the sync_jobs I have checked
> the <brick>/.processing directory on the master nodes. the result
> was the same for every master node.
>
> since the last error about 12 hours ago nearly 2400 changelog
> files were created on each node but it looks like none of them
> were consumed.
>
> Processed changelogs that are synced are archived under
> <brick>/.processed directory. Verify if the latest file is created there.
>
> in the moment I'm not sure what is right and what is
> wrong...should at least the oldest changelog files in this
> directory have been processed gradually ?
>
> Also you can try to set the log-level to debug for a while and set it
> back to info(to avoid flooding of logs) and check the logs to get a
> better picture of the scenario.
> #gluster volume geo-replication <primary> <ip>::<secondary> config
> log-level DEBUG
> #gluster volume geo-replication <primary> <ip>::<secondary> config
> log-level INFO
>
> Regards,
> Shwetha
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210312/eb63f2ce/attachment.html>
More information about the Gluster-users
mailing list