[Gluster-users] no progress in geo-replication

Wed Mar 3 16:28:16 UTC 2021

Hi,

I'm having a problem with geo-replication. A short summary...
About two month ago I have added two further nodes to a distributed 
replicated volume. For that purpose I have stopped the geo-replication, 
added two nodes on mvol and svol and started a rebalance process on both 
sides. Once the rebalance process was finished I startet the 
geo-replication again.

After a few days and beside some Unicode Errors the status of the new 
added brick changed from hybrid crawl to history crawl. Since then no 
progress, no files / directories have been created on svol for a couple 
of days.

Looking for a possible reason I recognized that there is was 
/var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1 
directory on the new added slave nodes.
Obviously I forgot to add the new svol node IP addresses on all master's 
/etc/hosts. After fixing that I did the '... execute gsec_create' and 
'...create push-pem force' command again and corresponding directory 
were created. Geo-replication started normal, all active sessions were 
in history crawl (as shown below) and for a short while some data were 
transfered to svol. But for about a week nothing had changed on svol, 0 
byte transferred.

Meanwhile i have deleted (without reset-sync-time) and recreated the 
geo-rep session. the current status is as shown below but without any 
last_synced date.
an entry like "last_synced_entry": 1609283145 is still visible in 
/var/lib/glusterd/geo-replication/mvol1_gl-slave-01-int_svol1/*status 
and changelog files are continously created in 
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/<brick>/.processing. 

Short time ago i changed log_level to DEBUG for a moment. Unfortunately 
I got an 'EOFError: Ran out of input' in gsyncd.log and rebuild of 
.processing starts from beginning.
But one of the first very long lines in gsyncd.log looks like :

[2021-03-03 11:59:39.503881] D [repce(worker 
/brick1/mvol1):215:__call__] RepceClient: call 
9163:139944064358208:1614772779.4982471 history_getchanges -> 
['/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processing/CHANGELOG.1609280278',...

1609280278 means Tuesday, December 29, 2020 10:17:58 PM and would 
somehow fit to the last_synced date.

However, I got nearly 300k files in <brick>/.history/.processing and in 
in log/trace it seems that any file in <brick>/.history/.processing will 
be processed and transferred to <brick>/.processing.
My questions so far...
first of all, is everything still ok with this geo-replication ?
do i have to wait until all changelog files in 
<brick>/.history/.processing are processed until transfers to svol start ?
what happens if any other error appears in geo-replication while these 
changelog files are processed resp. crawl status is history crawl ... 
does the entire process starts from the beginning ? would a checkpiont 
be helpful...for future decisions...?
is there any suitable setting in the gluster-environment which would 
take influence on the speed of the processing (current settings attached) ?

I hope someone can help...

best regards
dietmar

[ 15:17:47 ] - root at gl-master-01 
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history 
$ls .processing/ | wc -l
294669

[ 12:56:31 ] - root at gl-master-01  ~ $gluster volume geo-replication 
mvol1 gl-slave-01-int::svol1 status

MASTER NODE         MASTER VOL    MASTER BRICK     SLAVE USER 
SLAVE                     SLAVE NODE         STATUS     CRAWL STATUS     
LAST_SYNCED
----------------------------------------------------------------------------------------------------------------------------------------------------
gl-master-01-int    mvol1         /brick1/mvol1    root 
gl-slave-01-int::svol1    gl-slave-05-int    Active     History Crawl    
2020-12-29 23:00:48
gl-master-01-int    mvol1         /brick2/mvol1    root 
gl-slave-01-int::svol1    gl-slave-03-int    Active     History Crawl    
2020-12-29 23:05:45
gl-master-05-int    mvol1         /brick1/mvol1    root 
gl-slave-01-int::svol1    gl-slave-03-int    Active     History Crawl    
2021-02-20 17:38:38
gl-master-06-int    mvol1         /brick1/mvol1    root 
gl-slave-01-int::svol1    gl-slave-06-int    Passive N/A              N/A
gl-master-03-int    mvol1         /brick1/mvol1    root 
gl-slave-01-int::svol1    gl-slave-05-int    Passive N/A              N/A
gl-master-03-int    mvol1         /brick2/mvol1    root 
gl-slave-01-int::svol1    gl-slave-04-int    Active     History Crawl    
2020-12-29 23:07:34
gl-master-04-int    mvol1         /brick1/mvol1    root 
gl-slave-01-int::svol1    gl-slave-06-int    Active     History Crawl    
2020-12-29 23:07:22
gl-master-04-int    mvol1         /brick2/mvol1    root 
gl-slave-01-int::svol1    gl-slave-01-int    Passive N/A              N/A
gl-master-02-int    mvol1         /brick1/mvol1    root 
gl-slave-01-int::svol1    gl-slave-01-int    Passive N/A              N/A
gl-master-02-int    mvol1         /brick2/mvol1    root 
gl-slave-01-int::svol1    gl-slave-06-int    Passive N/A              N/A
[ 13:14:47 ] - root at gl-master-01  ~ $

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210303/957a1474/attachment.html>
-------------- next part --------------
Option                                  Value                                   
------                                  -----                                   
cluster.lookup-unhashed                 on                                      
cluster.lookup-optimize                 on                                      
cluster.min-free-disk                   200GB                                   
cluster.min-free-inodes                 5%                                      
cluster.rebalance-stats                 off                                     
cluster.subvols-per-directory           (null)                                  
cluster.readdir-optimize                off                                     
cluster.rsync-hash-regex                (null)                                  
cluster.extra-hash-regex                (null)                                  
cluster.dht-xattr-name                  trusted.glusterfs.dht                   
cluster.randomize-hash-range-by-gfid    off                                     
cluster.rebal-throttle                  normal                                  
cluster.lock-migration                  off                                     
cluster.force-migration                 off                                     
cluster.local-volume-name               (null)                                  
cluster.weighted-rebalance              on                                      
cluster.switch-pattern                  (null)                                  
cluster.entry-change-log                on                                      
cluster.read-subvolume                  (null)                                  
cluster.read-subvolume-index            -1                                      
cluster.read-hash-mode                  1                                       
cluster.background-self-heal-count      8                                       
cluster.metadata-self-heal              off                                     
cluster.data-self-heal                  off                                     
cluster.entry-self-heal                 off                                     
cluster.self-heal-daemon                on                                      
cluster.heal-timeout                    600                                     
cluster.self-heal-window-size           1                                       
cluster.data-change-log                 on                                      
cluster.metadata-change-log             on                                      
cluster.data-self-heal-algorithm        (null)                                  
cluster.eager-lock                      on                                      
disperse.eager-lock                     on                                      
disperse.other-eager-lock               on                                      
disperse.eager-lock-timeout             1                                       
disperse.other-eager-lock-timeout       1                                       
cluster.quorum-type                     none                                    
cluster.quorum-count                    (null)                                  
cluster.choose-local                    true                                    
cluster.self-heal-readdir-size          1KB                                     
cluster.post-op-delay-secs              1                                       
cluster.ensure-durability               on                                      
cluster.consistent-metadata             no                                      
cluster.heal-wait-queue-length          128                                     
cluster.favorite-child-policy           none                                    
cluster.full-lock                       yes                                     
cluster.optimistic-change-log           on                                      
diagnostics.latency-measurement         off                                     
diagnostics.dump-fd-stats               off                                     
diagnostics.count-fop-hits              off                                     
diagnostics.brick-log-level             INFO                                    
diagnostics.client-log-level            ERROR                                   
diagnostics.brick-sys-log-level         CRITICAL                                
diagnostics.client-sys-log-level        CRITICAL                                
diagnostics.brick-logger                (null)                                  
diagnostics.client-logger               (null)                                  
diagnostics.brick-log-format            (null)                                  
diagnostics.client-log-format           (null)                                  
diagnostics.brick-log-buf-size          5                                       
diagnostics.client-log-buf-size         5                                       
diagnostics.brick-log-flush-timeout     120                                     
diagnostics.client-log-flush-timeout    120                                     
diagnostics.stats-dump-interval         0                                       
diagnostics.fop-sample-interval         0                                       
diagnostics.stats-dump-format           json                                    
diagnostics.fop-sample-buf-size         65535                                   
diagnostics.stats-dnscache-ttl-sec      86400                                   
performance.cache-max-file-size         0                                       
performance.cache-min-file-size         0                                       
performance.cache-refresh-timeout       32                                      
performance.cache-priority                                                      
performance.cache-size                  16GB                                    
performance.io-thread-count             64                                      
performance.high-prio-threads           16                                      
performance.normal-prio-threads         16                                      
performance.low-prio-threads            16                                      
performance.least-prio-threads          1                                       
performance.enable-least-priority       on                                      
performance.iot-watchdog-secs           (null)                                  
performance.iot-cleanup-disconnected-reqsoff                                     
performance.iot-pass-through            false                                   
performance.io-cache-pass-through       false                                   
performance.cache-size                  16GB                                    
performance.qr-cache-timeout            1                                       
performance.cache-invalidation          false                                   
performance.ctime-invalidation          false                                   
performance.flush-behind                on                                      
performance.nfs.flush-behind            on                                      
performance.write-behind-window-size    4MB                                     
performance.resync-failed-syncs-after-fsyncoff                                     
performance.nfs.write-behind-window-size1MB                                     
performance.strict-o-direct             off                                     
performance.nfs.strict-o-direct         off                                     
performance.strict-write-ordering       off                                     
performance.nfs.strict-write-ordering   off                                     
performance.write-behind-trickling-writeson                                      
performance.aggregate-size              128KB                                   
performance.nfs.write-behind-trickling-writeson                                      
performance.lazy-open                   yes                                     
performance.read-after-open             yes                                     
performance.open-behind-pass-through    false                                   
performance.read-ahead-page-count       4                                       
performance.read-ahead-pass-through     false                                   
performance.readdir-ahead-pass-through  false                                   
performance.md-cache-pass-through       false                                   
performance.md-cache-timeout            600                                     
performance.cache-swift-metadata        true                                    
performance.cache-samba-metadata        false                                   
performance.cache-capability-xattrs     true                                    
performance.cache-ima-xattrs            true                                    
performance.md-cache-statfs             off                                     
performance.xattr-cache-list                                                    
performance.nl-cache-pass-through       false                                   
network.frame-timeout                   1800                                    
network.ping-timeout                    20                                      
network.tcp-window-size                 (null)                                  
client.ssl                              off                                     
network.remote-dio                      disable                                 
client.event-threads                    4                                       
client.tcp-user-timeout                 0                                       
client.keepalive-time                   20                                      
client.keepalive-interval               2                                       
client.keepalive-count                  9                                       
network.tcp-window-size                 (null)                                  
network.inode-lru-limit                 200000                                  
auth.allow                              *                                       
auth.reject                             (null)                                  
transport.keepalive                     1                                       
server.allow-insecure                   on                                      
server.root-squash                      off                                     
server.all-squash                       off                                     
server.anonuid                          65534                                   
server.anongid                          65534                                   
server.statedump-path                   /var/run/gluster                        
server.outstanding-rpc-limit            64                                      
server.ssl                              off                                     
auth.ssl-allow                          *                                       
server.manage-gids                      off                                     
server.dynamic-auth                     on                                      
client.send-gids                        on                                      
server.gid-timeout                      300                                     
server.own-thread                       (null)                                  
server.event-threads                    4                                       
server.tcp-user-timeout                 42                                      
server.keepalive-time                   20                                      
server.keepalive-interval               2                                       
server.keepalive-count                  9                                       
transport.listen-backlog                1024                                    
transport.address-family                inet                                    
performance.write-behind                on                                      
performance.read-ahead                  on                                      
performance.readdir-ahead               on                                      
performance.io-cache                    on                                      
performance.open-behind                 on                                      
performance.quick-read                  on                                      
performance.nl-cache                    off                                     
performance.stat-prefetch               off                                     
performance.client-io-threads           off                                     
performance.nfs.write-behind            on                                      
performance.nfs.read-ahead              on                                      
performance.nfs.io-cache                off                                     
performance.nfs.quick-read              on                                      
performance.nfs.stat-prefetch           off                                     
performance.nfs.io-threads              on                                      
performance.force-readdirp              true                                    
performance.cache-invalidation          false                                   
performance.global-cache-invalidation   true                                    
features.uss                            off                                     
features.snapshot-directory             .snaps                                  
features.show-snapshot-directory        off                                     
features.tag-namespaces                 off                                     
network.compression                     off                                     
network.compression.window-size         -15                                     
network.compression.mem-level           8                                       
network.compression.min-size            0                                       
network.compression.compression-level   -1                                      
network.compression.debug               false                                   
features.default-soft-limit             80%                                     
features.soft-timeout                   60                                      
features.hard-timeout                   5                                       
features.alert-time                     86400                                   
features.quota-deem-statfs              off                                     
geo-replication.indexing                on                                      
geo-replication.indexing                on                                      
geo-replication.ignore-pid-check        on                                      
geo-replication.ignore-pid-check        on                                      
features.quota                          off                                     
features.inode-quota                    off                                     
features.bitrot                         disable                                 
debug.trace                             off                                     
debug.log-history                       no                                      
debug.log-file                          no                                      
debug.exclude-ops                       (null)                                  
debug.include-ops                       (null)                                  
debug.error-gen                         off                                     
debug.error-failure                     (null)                                  
debug.error-number                      (null)                                  
debug.random-failure                    off                                     
debug.error-fops                        (null)                                  
nfs.disable                             on                                      
features.read-only                      off                                     
features.worm                           off                                     
features.worm-file-level                off                                     
features.worm-files-deletable           on                                      
features.default-retention-period       120                                     
features.retention-mode                 relax                                   
features.auto-commit-period             180                                     
storage.linux-aio                       off                                     
storage.batch-fsync-mode                reverse-fsync                           
storage.batch-fsync-delay-usec          0                                       
storage.owner-uid                       -1                                      
storage.owner-gid                       -1                                      
storage.node-uuid-pathinfo              off                                     
storage.health-check-interval           30                                      
storage.build-pgfid                     off                                     
storage.gfid2path                       on                                      
storage.gfid2path-separator             :                                       
storage.reserve                         1                                       
storage.reserve-size                    0                                       
storage.health-check-timeout            10                                      
storage.fips-mode-rchecksum             on                                      
storage.force-create-mode               0000                                    
storage.force-directory-mode            0000                                    
storage.create-mask                     0777                                    
storage.create-directory-mask           0777                                    
storage.max-hardlinks                   100                                     
features.ctime                          on                                      
config.gfproxyd                         off                                     
cluster.server-quorum-type              off                                     
cluster.server-quorum-ratio             51                                      
changelog.changelog                     on                                      
changelog.changelog-dir                 {{ brick.path }}/.glusterfs/changelogs  
changelog.encoding                      ascii                                   
changelog.rollover-time                 15                                      
changelog.fsync-interval                5                                       
changelog.changelog-barrier-timeout     120                                     
changelog.capture-del-path              off                                     
features.barrier                        disable                                 
features.barrier-timeout                120                                     
features.trash                          off                                     
features.trash-dir                      .trashcan                               
features.trash-eliminate-path           (null)                                  
features.trash-max-filesize             5GB                                     
features.trash-internal-op              off                                     
cluster.enable-shared-storage           disable                                 
locks.trace                             off                                     
locks.mandatory-locking                 off                                     
cluster.disperse-self-heal-daemon       enable                                  
cluster.quorum-reads                    no                                      
client.bind-insecure                    (null)                                  
features.timeout                        45                                      
features.failover-hosts                 (null)                                  
features.shard                          off                                     
features.shard-block-size               64MB                                    
features.shard-lru-limit                16384                                   
features.shard-deletion-rate            100                                     
features.scrub-throttle                 lazy                                    
features.scrub-freq                     biweekly                                
features.scrub                          false                                   
features.expiry-time                    120                                     
features.cache-invalidation             on                                      
features.cache-invalidation-timeout     600                                     
features.leases                         off                                     
features.lease-lock-recall-timeout      60                                      
disperse.background-heals               8                                       
disperse.heal-wait-qlength              128                                     
cluster.heal-timeout                    600                                     
dht.force-readdirp                      on                                      
disperse.read-policy                    gfid-hash                               
cluster.shd-max-threads                 1                                       
cluster.shd-wait-qlength                1024                                    
cluster.locking-scheme                  full                                    
cluster.granular-entry-heal             no                                      
features.locks-revocation-secs          0                                       
features.locks-revocation-clear-all     false                                   
features.locks-revocation-max-blocked   0                                       
features.locks-monkey-unlocking         false                                   
features.locks-notify-contention        no                                      
features.locks-notify-contention-delay  5                                       
disperse.shd-max-threads                1                                       
disperse.shd-wait-qlength               1024                                    
disperse.cpu-extensions                 auto                                    
disperse.self-heal-window-size          1                                       
cluster.use-compound-fops               off                                     
performance.parallel-readdir            on                                      
performance.rda-request-size            131072                                  
performance.rda-low-wmark               4096                                    
performance.rda-high-wmark              128KB                                   
performance.rda-cache-limit             10MB                                    
performance.nl-cache-positive-entry     false                                   
performance.nl-cache-limit              10MB                                    
performance.nl-cache-timeout            600                                     
cluster.brick-multiplex                 disable                                 
glusterd.vol_count_per_thread           100                                     
cluster.max-bricks-per-process          250                                     
disperse.optimistic-change-log          on                                      
disperse.stripe-cache                   4                                       
cluster.halo-enabled                    False                                   
cluster.halo-shd-max-latency            99999                                   
cluster.halo-nfsd-max-latency           5                                       
cluster.halo-max-latency                5                                       
cluster.halo-max-replicas               99999                                   
cluster.halo-min-replicas               2                                       
features.selinux                        on                                      
cluster.daemon-log-level                INFO                                    
debug.delay-gen                         off                                     
delay-gen.delay-percentage              10%                                     
delay-gen.delay-duration                100000                                  
delay-gen.enable                                                                
disperse.parallel-writes                on                                      
features.sdfs                           off                                     
features.cloudsync                      off                                     
features.ctime                          on                                      
ctime.noatime                           on                                      
features.cloudsync-storetype            (null)                                  
features.enforce-mandatory-lock         off                                     
config.global-threading                 off                                     
config.client-threads                   16                                      
config.brick-threads                    16                                      
features.cloudsync-remote-read          off                                     
features.cloudsync-store-id             (null)                                  
features.cloudsync-product-id           (null)                                  
-------------- next part --------------
Volume Name: mvol1
Type: Distributed-Replicate
Volume ID: 2f5de6e4-66de-40a7-9f24-4762aad3ca96
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Bricks:
Brick1: gl-master-01-int:/brick1/mvol1
Brick2: gl-master-02-int:/brick1/mvol1
Brick3: gl-master-03-int:/brick1/mvol1
Brick4: gl-master-04-int:/brick1/mvol1
Brick5: gl-master-01-int:/brick2/mvol1
Brick6: gl-master-02-int:/brick2/mvol1
Brick7: gl-master-03-int:/brick2/mvol1
Brick8: gl-master-04-int:/brick2/mvol1
Brick9: gl-master-05-int:/brick1/mvol1
Brick10: gl-master-06-int:/brick1/mvol1
Options Reconfigured:
performance.parallel-readdir: on
performance.readdir-ahead: on
storage.fips-mode-rchecksum: on
performance.stat-prefetch: off
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.md-cache-timeout: 600
network.inode-lru-limit: 200000
performance.nl-cache: off
performance.nl-cache-timeout: 600
client.event-threads: 4
server.event-threads: 4
performance.write-behind-window-size: 4MB
performance.nfs.io-threads: on
performance.nfs.quick-read: on
performance.nfs.read-ahead: on
transport.address-family: inet
features.trash-max-filesize: 5GB
features.trash: off
performance.cache-size: 16GB
performance.io-thread-count: 64
network.ping-timeout: 20
cluster.min-free-disk: 200GB
performance.cache-refresh-timeout: 32
changelog.changelog: on
diagnostics.client-log-level: ERROR
nfs.disable: on
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
-------------- next part --------------
access_mount:true
allow_network:
change_detector:changelog
change_interval:5
changelog_archive_format:%Y%m
changelog_batch_size:727040
changelog_log_file:/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/changes-${local_id}.log
changelog_log_level:INFO
checkpoint:0
cli_log_file:/var/log/glusterfs/geo-replication/cli.log
cli_log_level:INFO
connection_timeout:60
georep_session_working_dir:/var/lib/glusterd/geo-replication/mvol1_gl-slave-01-int_svol1/
gfid_conflict_resolution:true
gluster_cli_options:
gluster_command:gluster
gluster_command_dir:/usr/sbin
gluster_log_file:/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/mnt-${local_id}.log
gluster_log_level:INFO
gluster_logdir:/var/log/glusterfs
gluster_params:aux-gfid-mount acl
gluster_rundir:/var/run/gluster
glusterd_workdir:/var/lib/glusterd
gsyncd_miscdir:/var/lib/misc/gluster/gsyncd
ignore_deletes:false
isolated_slaves:
log_file:/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log
log_level:INFO
log_rsync_performance:false
master_disperse_count:1
master_distribution_count:2
master_replica_count:1
max_rsync_retries:10
meta_volume_mnt:/var/run/gluster/shared_storage
pid_file:/var/run/gluster/gsyncd-mvol1-gl-slave-01-int-svol1.pid
remote_gsyncd:
replica_failover_interval:1
rsync_command:rsync
rsync_opt_existing:true
rsync_opt_ignore_missing_args:true
rsync_options:
rsync_ssh_options:
slave_access_mount:false
slave_gluster_command_dir:/usr/sbin
slave_gluster_log_file:/var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1/mnt-${master_node}-${master_brick_id}.log
slave_gluster_log_file_mbr:/var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1/mnt-mbr-${master_node}-${master_brick_id}.log
slave_gluster_log_level:INFO
slave_gluster_params:aux-gfid-mount acl
slave_log_file:/var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1/gsyncd.log
slave_log_level:INFO
slave_timeout:120
special_sync_mode:
ssh_command:ssh
ssh_options:-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem
ssh_options_tar:-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem
ssh_port:22
state_file:/var/lib/glusterd/geo-replication/mvol1_gl-slave-01-int_svol1/monitor.status
state_socket_unencoded:
stime_xattr_prefix:trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.256628ab-57c2-44a4-9367-59e1939ade64
sync_acls:true
sync_jobs:3
sync_method:rsync
sync_xattrs:true
tar_command:tar
use_meta_volume:true
use_rsync_xattrs:false
working_dir:/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/