[Gluster-users] no progress in geo-replication
Dietmar Putz
dietmar.putz at 3q.video
Wed Mar 3 16:28:16 UTC 2021
Hi,
I'm having a problem with geo-replication. A short summary...
About two month ago I have added two further nodes to a distributed
replicated volume. For that purpose I have stopped the geo-replication,
added two nodes on mvol and svol and started a rebalance process on both
sides. Once the rebalance process was finished I startet the
geo-replication again.
After a few days and beside some Unicode Errors the status of the new
added brick changed from hybrid crawl to history crawl. Since then no
progress, no files / directories have been created on svol for a couple
of days.
Looking for a possible reason I recognized that there is was
/var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1
directory on the new added slave nodes.
Obviously I forgot to add the new svol node IP addresses on all master's
/etc/hosts. After fixing that I did the '... execute gsec_create' and
'...create push-pem force' command again and corresponding directory
were created. Geo-replication started normal, all active sessions were
in history crawl (as shown below) and for a short while some data were
transfered to svol. But for about a week nothing had changed on svol, 0
byte transferred.
Meanwhile i have deleted (without reset-sync-time) and recreated the
geo-rep session. the current status is as shown below but without any
last_synced date.
an entry like "last_synced_entry": 1609283145 is still visible in
/var/lib/glusterd/geo-replication/mvol1_gl-slave-01-int_svol1/*status
and changelog files are continously created in
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/<brick>/.processing.
Short time ago i changed log_level to DEBUG for a moment. Unfortunately
I got an 'EOFError: Ran out of input' in gsyncd.log and rebuild of
.processing starts from beginning.
But one of the first very long lines in gsyncd.log looks like :
[2021-03-03 11:59:39.503881] D [repce(worker
/brick1/mvol1):215:__call__] RepceClient: call
9163:139944064358208:1614772779.4982471 history_getchanges ->
['/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processing/CHANGELOG.1609280278',...
1609280278 means Tuesday, December 29, 2020 10:17:58 PM and would
somehow fit to the last_synced date.
However, I got nearly 300k files in <brick>/.history/.processing and in
in log/trace it seems that any file in <brick>/.history/.processing will
be processed and transferred to <brick>/.processing.
My questions so far...
first of all, is everything still ok with this geo-replication ?
do i have to wait until all changelog files in
<brick>/.history/.processing are processed until transfers to svol start ?
what happens if any other error appears in geo-replication while these
changelog files are processed resp. crawl status is history crawl ...
does the entire process starts from the beginning ? would a checkpiont
be helpful...for future decisions...?
is there any suitable setting in the gluster-environment which would
take influence on the speed of the processing (current settings attached) ?
I hope someone can help...
best regards
dietmar
[ 15:17:47 ] - root at gl-master-01
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history
$ls .processing/ | wc -l
294669
[ 12:56:31 ] - root at gl-master-01 ~ $gluster volume geo-replication
mvol1 gl-slave-01-int::svol1 status
MASTER NODE MASTER VOL MASTER BRICK SLAVE USER
SLAVE SLAVE NODE STATUS CRAWL STATUS
LAST_SYNCED
----------------------------------------------------------------------------------------------------------------------------------------------------
gl-master-01-int mvol1 /brick1/mvol1 root
gl-slave-01-int::svol1 gl-slave-05-int Active History Crawl
2020-12-29 23:00:48
gl-master-01-int mvol1 /brick2/mvol1 root
gl-slave-01-int::svol1 gl-slave-03-int Active History Crawl
2020-12-29 23:05:45
gl-master-05-int mvol1 /brick1/mvol1 root
gl-slave-01-int::svol1 gl-slave-03-int Active History Crawl
2021-02-20 17:38:38
gl-master-06-int mvol1 /brick1/mvol1 root
gl-slave-01-int::svol1 gl-slave-06-int Passive N/A N/A
gl-master-03-int mvol1 /brick1/mvol1 root
gl-slave-01-int::svol1 gl-slave-05-int Passive N/A N/A
gl-master-03-int mvol1 /brick2/mvol1 root
gl-slave-01-int::svol1 gl-slave-04-int Active History Crawl
2020-12-29 23:07:34
gl-master-04-int mvol1 /brick1/mvol1 root
gl-slave-01-int::svol1 gl-slave-06-int Active History Crawl
2020-12-29 23:07:22
gl-master-04-int mvol1 /brick2/mvol1 root
gl-slave-01-int::svol1 gl-slave-01-int Passive N/A N/A
gl-master-02-int mvol1 /brick1/mvol1 root
gl-slave-01-int::svol1 gl-slave-01-int Passive N/A N/A
gl-master-02-int mvol1 /brick2/mvol1 root
gl-slave-01-int::svol1 gl-slave-06-int Passive N/A N/A
[ 13:14:47 ] - root at gl-master-01 ~ $
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210303/957a1474/attachment.html>
-------------- next part --------------
Option Value
------ -----
cluster.lookup-unhashed on
cluster.lookup-optimize on
cluster.min-free-disk 200GB
cluster.min-free-inodes 5%
cluster.rebalance-stats off
cluster.subvols-per-directory (null)
cluster.readdir-optimize off
cluster.rsync-hash-regex (null)
cluster.extra-hash-regex (null)
cluster.dht-xattr-name trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid off
cluster.rebal-throttle normal
cluster.lock-migration off
cluster.force-migration off
cluster.local-volume-name (null)
cluster.weighted-rebalance on
cluster.switch-pattern (null)
cluster.entry-change-log on
cluster.read-subvolume (null)
cluster.read-subvolume-index -1
cluster.read-hash-mode 1
cluster.background-self-heal-count 8
cluster.metadata-self-heal off
cluster.data-self-heal off
cluster.entry-self-heal off
cluster.self-heal-daemon on
cluster.heal-timeout 600
cluster.self-heal-window-size 1
cluster.data-change-log on
cluster.metadata-change-log on
cluster.data-self-heal-algorithm (null)
cluster.eager-lock on
disperse.eager-lock on
disperse.other-eager-lock on
disperse.eager-lock-timeout 1
disperse.other-eager-lock-timeout 1
cluster.quorum-type none
cluster.quorum-count (null)
cluster.choose-local true
cluster.self-heal-readdir-size 1KB
cluster.post-op-delay-secs 1
cluster.ensure-durability on
cluster.consistent-metadata no
cluster.heal-wait-queue-length 128
cluster.favorite-child-policy none
cluster.full-lock yes
cluster.optimistic-change-log on
diagnostics.latency-measurement off
diagnostics.dump-fd-stats off
diagnostics.count-fop-hits off
diagnostics.brick-log-level INFO
diagnostics.client-log-level ERROR
diagnostics.brick-sys-log-level CRITICAL
diagnostics.client-sys-log-level CRITICAL
diagnostics.brick-logger (null)
diagnostics.client-logger (null)
diagnostics.brick-log-format (null)
diagnostics.client-log-format (null)
diagnostics.brick-log-buf-size 5
diagnostics.client-log-buf-size 5
diagnostics.brick-log-flush-timeout 120
diagnostics.client-log-flush-timeout 120
diagnostics.stats-dump-interval 0
diagnostics.fop-sample-interval 0
diagnostics.stats-dump-format json
diagnostics.fop-sample-buf-size 65535
diagnostics.stats-dnscache-ttl-sec 86400
performance.cache-max-file-size 0
performance.cache-min-file-size 0
performance.cache-refresh-timeout 32
performance.cache-priority
performance.cache-size 16GB
performance.io-thread-count 64
performance.high-prio-threads 16
performance.normal-prio-threads 16
performance.low-prio-threads 16
performance.least-prio-threads 1
performance.enable-least-priority on
performance.iot-watchdog-secs (null)
performance.iot-cleanup-disconnected-reqsoff
performance.iot-pass-through false
performance.io-cache-pass-through false
performance.cache-size 16GB
performance.qr-cache-timeout 1
performance.cache-invalidation false
performance.ctime-invalidation false
performance.flush-behind on
performance.nfs.flush-behind on
performance.write-behind-window-size 4MB
performance.resync-failed-syncs-after-fsyncoff
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct off
performance.nfs.strict-o-direct off
performance.strict-write-ordering off
performance.nfs.strict-write-ordering off
performance.write-behind-trickling-writeson
performance.aggregate-size 128KB
performance.nfs.write-behind-trickling-writeson
performance.lazy-open yes
performance.read-after-open yes
performance.open-behind-pass-through false
performance.read-ahead-page-count 4
performance.read-ahead-pass-through false
performance.readdir-ahead-pass-through false
performance.md-cache-pass-through false
performance.md-cache-timeout 600
performance.cache-swift-metadata true
performance.cache-samba-metadata false
performance.cache-capability-xattrs true
performance.cache-ima-xattrs true
performance.md-cache-statfs off
performance.xattr-cache-list
performance.nl-cache-pass-through false
network.frame-timeout 1800
network.ping-timeout 20
network.tcp-window-size (null)
client.ssl off
network.remote-dio disable
client.event-threads 4
client.tcp-user-timeout 0
client.keepalive-time 20
client.keepalive-interval 2
client.keepalive-count 9
network.tcp-window-size (null)
network.inode-lru-limit 200000
auth.allow *
auth.reject (null)
transport.keepalive 1
server.allow-insecure on
server.root-squash off
server.all-squash off
server.anonuid 65534
server.anongid 65534
server.statedump-path /var/run/gluster
server.outstanding-rpc-limit 64
server.ssl off
auth.ssl-allow *
server.manage-gids off
server.dynamic-auth on
client.send-gids on
server.gid-timeout 300
server.own-thread (null)
server.event-threads 4
server.tcp-user-timeout 42
server.keepalive-time 20
server.keepalive-interval 2
server.keepalive-count 9
transport.listen-backlog 1024
transport.address-family inet
performance.write-behind on
performance.read-ahead on
performance.readdir-ahead on
performance.io-cache on
performance.open-behind on
performance.quick-read on
performance.nl-cache off
performance.stat-prefetch off
performance.client-io-threads off
performance.nfs.write-behind on
performance.nfs.read-ahead on
performance.nfs.io-cache off
performance.nfs.quick-read on
performance.nfs.stat-prefetch off
performance.nfs.io-threads on
performance.force-readdirp true
performance.cache-invalidation false
performance.global-cache-invalidation true
features.uss off
features.snapshot-directory .snaps
features.show-snapshot-directory off
features.tag-namespaces off
network.compression off
network.compression.window-size -15
network.compression.mem-level 8
network.compression.min-size 0
network.compression.compression-level -1
network.compression.debug false
features.default-soft-limit 80%
features.soft-timeout 60
features.hard-timeout 5
features.alert-time 86400
features.quota-deem-statfs off
geo-replication.indexing on
geo-replication.indexing on
geo-replication.ignore-pid-check on
geo-replication.ignore-pid-check on
features.quota off
features.inode-quota off
features.bitrot disable
debug.trace off
debug.log-history no
debug.log-file no
debug.exclude-ops (null)
debug.include-ops (null)
debug.error-gen off
debug.error-failure (null)
debug.error-number (null)
debug.random-failure off
debug.error-fops (null)
nfs.disable on
features.read-only off
features.worm off
features.worm-file-level off
features.worm-files-deletable on
features.default-retention-period 120
features.retention-mode relax
features.auto-commit-period 180
storage.linux-aio off
storage.batch-fsync-mode reverse-fsync
storage.batch-fsync-delay-usec 0
storage.owner-uid -1
storage.owner-gid -1
storage.node-uuid-pathinfo off
storage.health-check-interval 30
storage.build-pgfid off
storage.gfid2path on
storage.gfid2path-separator :
storage.reserve 1
storage.reserve-size 0
storage.health-check-timeout 10
storage.fips-mode-rchecksum on
storage.force-create-mode 0000
storage.force-directory-mode 0000
storage.create-mask 0777
storage.create-directory-mask 0777
storage.max-hardlinks 100
features.ctime on
config.gfproxyd off
cluster.server-quorum-type off
cluster.server-quorum-ratio 51
changelog.changelog on
changelog.changelog-dir {{ brick.path }}/.glusterfs/changelogs
changelog.encoding ascii
changelog.rollover-time 15
changelog.fsync-interval 5
changelog.changelog-barrier-timeout 120
changelog.capture-del-path off
features.barrier disable
features.barrier-timeout 120
features.trash off
features.trash-dir .trashcan
features.trash-eliminate-path (null)
features.trash-max-filesize 5GB
features.trash-internal-op off
cluster.enable-shared-storage disable
locks.trace off
locks.mandatory-locking off
cluster.disperse-self-heal-daemon enable
cluster.quorum-reads no
client.bind-insecure (null)
features.timeout 45
features.failover-hosts (null)
features.shard off
features.shard-block-size 64MB
features.shard-lru-limit 16384
features.shard-deletion-rate 100
features.scrub-throttle lazy
features.scrub-freq biweekly
features.scrub false
features.expiry-time 120
features.cache-invalidation on
features.cache-invalidation-timeout 600
features.leases off
features.lease-lock-recall-timeout 60
disperse.background-heals 8
disperse.heal-wait-qlength 128
cluster.heal-timeout 600
dht.force-readdirp on
disperse.read-policy gfid-hash
cluster.shd-max-threads 1
cluster.shd-wait-qlength 1024
cluster.locking-scheme full
cluster.granular-entry-heal no
features.locks-revocation-secs 0
features.locks-revocation-clear-all false
features.locks-revocation-max-blocked 0
features.locks-monkey-unlocking false
features.locks-notify-contention no
features.locks-notify-contention-delay 5
disperse.shd-max-threads 1
disperse.shd-wait-qlength 1024
disperse.cpu-extensions auto
disperse.self-heal-window-size 1
cluster.use-compound-fops off
performance.parallel-readdir on
performance.rda-request-size 131072
performance.rda-low-wmark 4096
performance.rda-high-wmark 128KB
performance.rda-cache-limit 10MB
performance.nl-cache-positive-entry false
performance.nl-cache-limit 10MB
performance.nl-cache-timeout 600
cluster.brick-multiplex disable
glusterd.vol_count_per_thread 100
cluster.max-bricks-per-process 250
disperse.optimistic-change-log on
disperse.stripe-cache 4
cluster.halo-enabled False
cluster.halo-shd-max-latency 99999
cluster.halo-nfsd-max-latency 5
cluster.halo-max-latency 5
cluster.halo-max-replicas 99999
cluster.halo-min-replicas 2
features.selinux on
cluster.daemon-log-level INFO
debug.delay-gen off
delay-gen.delay-percentage 10%
delay-gen.delay-duration 100000
delay-gen.enable
disperse.parallel-writes on
features.sdfs off
features.cloudsync off
features.ctime on
ctime.noatime on
features.cloudsync-storetype (null)
features.enforce-mandatory-lock off
config.global-threading off
config.client-threads 16
config.brick-threads 16
features.cloudsync-remote-read off
features.cloudsync-store-id (null)
features.cloudsync-product-id (null)
-------------- next part --------------
Volume Name: mvol1
Type: Distributed-Replicate
Volume ID: 2f5de6e4-66de-40a7-9f24-4762aad3ca96
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Bricks:
Brick1: gl-master-01-int:/brick1/mvol1
Brick2: gl-master-02-int:/brick1/mvol1
Brick3: gl-master-03-int:/brick1/mvol1
Brick4: gl-master-04-int:/brick1/mvol1
Brick5: gl-master-01-int:/brick2/mvol1
Brick6: gl-master-02-int:/brick2/mvol1
Brick7: gl-master-03-int:/brick2/mvol1
Brick8: gl-master-04-int:/brick2/mvol1
Brick9: gl-master-05-int:/brick1/mvol1
Brick10: gl-master-06-int:/brick1/mvol1
Options Reconfigured:
performance.parallel-readdir: on
performance.readdir-ahead: on
storage.fips-mode-rchecksum: on
performance.stat-prefetch: off
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.md-cache-timeout: 600
network.inode-lru-limit: 200000
performance.nl-cache: off
performance.nl-cache-timeout: 600
client.event-threads: 4
server.event-threads: 4
performance.write-behind-window-size: 4MB
performance.nfs.io-threads: on
performance.nfs.quick-read: on
performance.nfs.read-ahead: on
transport.address-family: inet
features.trash-max-filesize: 5GB
features.trash: off
performance.cache-size: 16GB
performance.io-thread-count: 64
network.ping-timeout: 20
cluster.min-free-disk: 200GB
performance.cache-refresh-timeout: 32
changelog.changelog: on
diagnostics.client-log-level: ERROR
nfs.disable: on
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
-------------- next part --------------
access_mount:true
allow_network:
change_detector:changelog
change_interval:5
changelog_archive_format:%Y%m
changelog_batch_size:727040
changelog_log_file:/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/changes-${local_id}.log
changelog_log_level:INFO
checkpoint:0
cli_log_file:/var/log/glusterfs/geo-replication/cli.log
cli_log_level:INFO
connection_timeout:60
georep_session_working_dir:/var/lib/glusterd/geo-replication/mvol1_gl-slave-01-int_svol1/
gfid_conflict_resolution:true
gluster_cli_options:
gluster_command:gluster
gluster_command_dir:/usr/sbin
gluster_log_file:/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/mnt-${local_id}.log
gluster_log_level:INFO
gluster_logdir:/var/log/glusterfs
gluster_params:aux-gfid-mount acl
gluster_rundir:/var/run/gluster
glusterd_workdir:/var/lib/glusterd
gsyncd_miscdir:/var/lib/misc/gluster/gsyncd
ignore_deletes:false
isolated_slaves:
log_file:/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log
log_level:INFO
log_rsync_performance:false
master_disperse_count:1
master_distribution_count:2
master_replica_count:1
max_rsync_retries:10
meta_volume_mnt:/var/run/gluster/shared_storage
pid_file:/var/run/gluster/gsyncd-mvol1-gl-slave-01-int-svol1.pid
remote_gsyncd:
replica_failover_interval:1
rsync_command:rsync
rsync_opt_existing:true
rsync_opt_ignore_missing_args:true
rsync_options:
rsync_ssh_options:
slave_access_mount:false
slave_gluster_command_dir:/usr/sbin
slave_gluster_log_file:/var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1/mnt-${master_node}-${master_brick_id}.log
slave_gluster_log_file_mbr:/var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1/mnt-mbr-${master_node}-${master_brick_id}.log
slave_gluster_log_level:INFO
slave_gluster_params:aux-gfid-mount acl
slave_log_file:/var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1/gsyncd.log
slave_log_level:INFO
slave_timeout:120
special_sync_mode:
ssh_command:ssh
ssh_options:-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem
ssh_options_tar:-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem
ssh_port:22
state_file:/var/lib/glusterd/geo-replication/mvol1_gl-slave-01-int_svol1/monitor.status
state_socket_unencoded:
stime_xattr_prefix:trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.256628ab-57c2-44a4-9367-59e1939ade64
sync_acls:true
sync_jobs:3
sync_method:rsync
sync_xattrs:true
tar_command:tar
use_meta_volume:true
use_rsync_xattrs:false
working_dir:/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/
More information about the Gluster-users
mailing list