[Gluster-users] Geo-Replication File not Found on /.glusterfs/XX/XX/XXXXXXXXXXXX

Wed Mar 25 09:14:59 UTC 2020

Hi,
I have a problem with the Geo-Replication system.
The first synchronization was successful a few days ago. But after a bit of 
filming I run into an error message preventing the sync from continuing.
I summarize a little the data on the configuration:

Debian 10
Glusterfs 7.3
Master volume: archivosvao
Slave volume: archivossamil

volume geo-replication archivosvao samil::archivossamil config
access_mount:false
allow_network:
change_detector:changelog
change_interval:5
changelog_archive_format:%Y%m
changelog_batch_size:727040
changelog_log_file:/var/log/glusterfs/geo-replication/
archivosvao_samil_archivossamil/changes-${local_id}.log
changelog_log_level:INFO
checkpoint:0
cli_log_file:/var/log/glusterfs/geo-replication/cli.log
cli_log_level:INFO
connection_timeout:60
georep_session_working_dir:/var/lib/glusterd/geo-replication/
archivosvao_samil_archivossamil/
gfid_conflict_resolution:true
gluster_cli_options:
gluster_command:gluster
gluster_command_dir:/usr/sbin
gluster_log_file:/var/log/glusterfs/geo-replication/
archivosvao_samil_archivossamil/mnt-${local_id}.log
gluster_log_level:INFO
gluster_logdir:/var/log/glusterfs
gluster_params:aux-gfid-mount acl
gluster_rundir:/var/run/gluster
glusterd_workdir:/var/lib/glusterd
gsyncd_miscdir:/var/lib/misc/gluster/gsyncd
ignore_deletes:false
isolated_slaves:
log_file:/var/log/glusterfs/geo-replication/archivosvao_samil_archivossamil/
gsyncd.log
log_level:INFO
log_rsync_performance:false
master_disperse_count:1
master_distribution_count:1
master_replica_count:1
max_rsync_retries:10
meta_volume_mnt:/var/run/gluster/shared_storage
pid_file:/var/run/gluster/gsyncd-archivosvao-samil-archivossamil.pid
remote_gsyncd:
replica_failover_interval:1
rsync_command:rsync
rsync_opt_existing:true
rsync_opt_ignore_missing_args:true
rsync_options:
rsync_ssh_options:
slave_access_mount:false
slave_gluster_command_dir:/usr/sbin
slave_gluster_log_file:/var/log/glusterfs/geo-replication-slaves/
archivosvao_samil_archivossamil/mnt-${master_node}-${master_brick_id}.log
slave_gluster_log_file_mbr:/var/log/glusterfs/geo-replication-slaves/
archivosvao_samil_archivossamil/mnt-mbr-${master_node}-${master_brick_id}.log
slave_gluster_log_level:INFO
slave_gluster_params:aux-gfid-mount acl
slave_log_file:/var/log/glusterfs/geo-replication-slaves/
archivosvao_samil_archivossamil/gsyncd.log
slave_log_level:INFO
slave_timeout:120
special_sync_mode:
ssh_command:ssh
ssh_options:-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/
lib/glusterd/geo-replication/secret.pem
ssh_options_tar:-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /
var/lib/glusterd/geo-replication/tar_ssh.pem
ssh_port:22
state_file:/var/lib/glusterd/geo-replication/archivosvao_samil_archivossamil/
monitor.status
state_socket_unencoded:
stime_xattr_prefix:trusted.glusterfs.c7fa7778-
f2e4-48f9-8817-5811c09964d5.8d4c7ef7-35fc-497a-9425-66f4aced159b
sync_acls:true
sync_jobs:3
sync_method:rsync
sync_xattrs:true
tar_command:tar
use_meta_volume:false
use_rsync_xattrs:false
working_dir:/var/lib/misc/gluster/gsyncd/archivosvao_samil_archivossamil/

gluster> volume info

Volume Name: archivossamil
Type: Distribute
Volume ID: 8d4c7ef7-35fc-497a-9425-66f4aced159b
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: samil:/brickarchivos/archivos
Options Reconfigured:
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
features.read-only: on

Volume Name: archivosvao
Type: Distribute
Volume ID: c7fa7778-f2e4-48f9-8817-5811c09964d5
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: vao:/brickarchivos/archivos
Options Reconfigured:
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on

Volume Name: home
Type: Replicate
Volume ID: 74522542-5d7a-4fdd-9cea-76bf1ff27e7d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: samil:/brickhome/home
Brick2: vao:/brickhome/home
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet

These errors appear in the master logs:

.............

[2020-03-25 09:00:12.554226] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken    job=1   num_files=2     return_code=0   
duration=0.0483
[2020-03-25 09:00:12.772688] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken    job=2   num_files=3     return_code=0   
duration=0.0539
[2020-03-25 09:00:13.112986] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken    job=1   num_files=2     return_code=0   
duration=0.0575
[2020-03-25 09:00:13.311976] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken    job=2   num_files=1     return_code=0   
duration=0.0379
[2020-03-25 09:00:13.382845] I [master(worker /brickarchivos/archivos):
1227:process_change] _GMaster: Entry ops failed with gfid mismatch       
count=1
[2020-03-25 09:00:13.385680] E [syncdutils(worker /brickarchivos/archivos):
339:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line 
332, in main func(args)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/subcmds.py", 
line 86, in subcmd_worker    local.service_loop(remote)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py", 
line 1297, in service_loop    g3.crawlwrap(oneshot=True)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
602, in crawlwrap    self.crawl()
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
1592, in crawl    self.changelogs_batch_process(changes)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
1492, in changelogs_batch_process    self.process(batch)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
1327, in process    self.process_change(change, done, retry)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
1230, in process_change    self.handle_entry_failures(failures, entries)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
973, in handle_entry_failures    failures1, retries, entry_ops1)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
936, in fix_possible_entry_failures    pargfid))
FileNotFoundError: [Errno 2] No such file or directory: '/brickarchivos/
archivos/.glusterfs/6e/eb/6eeb2c8f-da55-4066-995b-691290b69fdf'
[2020-03-25 09:00:13.435045] I [repce(agent /brickarchivos/archivos):
96:service_loop] RepceServer: terminating on reaching EOF.
[2020-03-25 09:00:14.248754] I [monitor(monitor):280:monitor] Monitor: worker 
died in startup phase     brick=/brickarchivos/archivos
[2020-03-25 09:00:16.83872] I [gsyncdstatus(monitor):248:set_worker_status] 
GeorepStatus: Worker Status Change  status=Faulty
[2020-03-25 09:00:36.304047] I [gsyncdstatus(monitor):248:set_worker_status] 
GeorepStatus: Worker Status Change status=Initializing...
[2020-03-25 09:00:36.304274] I [monitor(monitor):159:monitor] Monitor: 
starting gsyncd worker   brick=/brickarchivos/archivos   slave_node=samil
[2020-03-25 09:00:36.391111] I [gsyncd(agent /brickarchivos/archivos):
318:main] <top>: Using session config file        path=/var/lib/glusterd/geo-
replication/archivosvao_samil_archivossamil/gsyncd.conf
[2020-03-25 09:00:36.392865] I [changelogagent(agent /brickarchivos/archivos):
72:__init__] ChangelogAgent: Agent listining...
[2020-03-25 09:00:36.399606] I [gsyncd(worker /brickarchivos/archivos):
318:main] <top>: Using session config file       path=/var/lib/glusterd/geo-
replication/archivosvao_samil_archivossamil/gsyncd.conf
[2020-03-25 09:00:36.412956] I [resource(worker /brickarchivos/archivos):
1386:connect_remote] SSH: Initializing SSH connection between master and 
slave...
[2020-03-25 09:00:37.772666] I [resource(worker /brickarchivos/archivos):
1435:connect_remote] SSH: SSH connection between master and slave established. 
duration=1.3594
[2020-03-25 09:00:37.773320] I [resource(worker /brickarchivos/archivos):
1105:connect] GLUSTER: Mounting gluster volume locally...
[2020-03-25 09:00:38.821624] I [resource(worker /brickarchivos/archivos):
1128:connect] GLUSTER: Mounted gluster volume  duration=1.0479
[2020-03-25 09:00:38.822003] I [subcmds(worker /brickarchivos/archivos):
84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to 
monitor
[2020-03-25 09:00:41.797329] I [master(worker /brickarchivos/archivos):
1640:register] _GMaster: Working dir     path=/var/lib/misc/gluster/gsyncd/
archivosvao_samil_archivossamil/brickarchivos-archivos
[2020-03-25 09:00:41.798168] I [resource(worker /brickarchivos/archivos):
1291:service_loop] GLUSTER: Register time      time=1585126841
[2020-03-25 09:00:42.143373] I [gsyncdstatus(worker /brickarchivos/archivos):
281:set_active] GeorepStatus: Worker Status Change status=Active
[2020-03-25 09:00:42.310175] I [gsyncdstatus(worker /brickarchivos/archivos):
253:set_worker_crawl_status] GeorepStatus: Crawl Status Change     
status=History Crawl
[2020-03-25 09:00:42.311381] I [master(worker /brickarchivos/archivos):
1554:crawl] _GMaster: starting history crawl     turns=1 stime=(1585015849, 0)   
etime=1585126842        entry_stime=(1585043575, 0)
[2020-03-25 09:00:43.347883] I [master(worker /brickarchivos/archivos):
1583:crawl] _GMaster: slave's time       stime=(1585015849, 0)
[2020-03-25 09:00:43.932979] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken    job=1   num_files=7     return_code=0   
duration=0.1022
[2020-03-25 09:00:43.980473] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken    job=1   num_files=1     return_code=0   
duration=0.0467
[2020-03-25 09:00:44.387296] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken    job=2   num_files=5     return_code=0   
duration=0.0539
[2020-03-25 09:00:44.424803] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken    job=2   num_files=1     return_code=0   
duration=0.0368
[2020-03-25 09:00:44.877503] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken    job=3   num_files=4     return_code=0   
duration=0.0431
[2020-03-25 09:00:44.918785] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken    job=3   num_files=3     return_code=0   
duration=0.0403
[2020-03-25 09:00:45.20351] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken     job=1   num_files=1     return_code=0   
duration=0.0382
[2020-03-25 09:00:45.55611] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken     job=1   num_files=1     return_code=0   
duration=0.0344
[2020-03-25 09:00:45.90699] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken     job=1   num_files=1     return_code=0   
duration=0.0341

.............

It seems that the source of the error is the absence of this file:

FileNotFoundError: [Errno 2] No such file or directory: '/brickarchivos/
archivos/.glusterfs/6e/eb/6eeb2c8f-da55-4066-995b-691290b69fdf'

When the error appears, it tries to synchronize again and enters a retry loop.

I have stopped syncing and trying to resume it the next day. Now the error 
appears in a different file, but always within the index path of the Gluster

[2020-03-23 16:49:20.729115] I [master(worker /brickarchivos/archivos):
1227:process_change] _GMaster: Entry ops failed with gfid mismatch       
count=1
[2020-03-23 16:49:20.731028] E [syncdutils(worker /brickarchivos/archivos):
339:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line 
332, in main    func(args)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/subcmds.py", 
line 86, in subcmd_worker    local.service_loop(remote)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py", 
line 1297, in service_loop    g3.crawlwrap(oneshot=True)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
602, in crawlwrap    self.crawl()
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
1592, in crawl    self.changelogs_batch_process(changes)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
1492, in changelogs_batch_process    self.process(batch)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
1327, in process    self.process_change(change, done, retry)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
1230, in process_change    self.handle_entry_failures(failures, entries)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
973, in handle_entry_failures    failures1, retries, entry_ops1)
  File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 
936, in fix_possible_entry_failures
    pargfid))
FileNotFoundError: [Errno 2] No such file or directory: '/brickarchivos/
archivos/.glusterfs/63/11/63113be6-0774-4719-96a6-619f7777aed2'
[2020-03-23 16:49:20.764215] I [repce(agent /brickarchivos/archivos):
96:service_loop] RepceServer: terminating on reaching EOF.

I have tried to remove the geo-replication and recreate it, but the problem 
recurs.
I do not delete the slave data since they are more than 2.5 Tb and it would 
take several days to synchronize again:

volume geo-replication archivosvao samil::archivossamil stop
volume geo-replication archivosvao samil::archivossamil delete
volume set archivosvao geo-replication.indexing off
volume geo-replication archivosvao samil::archivossamil create push-pem force
volume geo-replication archivosvao samil::archivossamil start

But without solution

Any help would be appreciated.
Thank you.