[Gluster-users] Geo-Replication File not Found on /.glusterfs/XX/XX/XXXXXXXXXXXX
Senén Vidal Blanco
senenvidal at sgisoft.com
Wed Mar 25 09:14:59 UTC 2020
Hi,
I have a problem with the Geo-Replication system.
The first synchronization was successful a few days ago. But after a bit of
filming I run into an error message preventing the sync from continuing.
I summarize a little the data on the configuration:
Debian 10
Glusterfs 7.3
Master volume: archivosvao
Slave volume: archivossamil
volume geo-replication archivosvao samil::archivossamil config
access_mount:false
allow_network:
change_detector:changelog
change_interval:5
changelog_archive_format:%Y%m
changelog_batch_size:727040
changelog_log_file:/var/log/glusterfs/geo-replication/
archivosvao_samil_archivossamil/changes-${local_id}.log
changelog_log_level:INFO
checkpoint:0
cli_log_file:/var/log/glusterfs/geo-replication/cli.log
cli_log_level:INFO
connection_timeout:60
georep_session_working_dir:/var/lib/glusterd/geo-replication/
archivosvao_samil_archivossamil/
gfid_conflict_resolution:true
gluster_cli_options:
gluster_command:gluster
gluster_command_dir:/usr/sbin
gluster_log_file:/var/log/glusterfs/geo-replication/
archivosvao_samil_archivossamil/mnt-${local_id}.log
gluster_log_level:INFO
gluster_logdir:/var/log/glusterfs
gluster_params:aux-gfid-mount acl
gluster_rundir:/var/run/gluster
glusterd_workdir:/var/lib/glusterd
gsyncd_miscdir:/var/lib/misc/gluster/gsyncd
ignore_deletes:false
isolated_slaves:
log_file:/var/log/glusterfs/geo-replication/archivosvao_samil_archivossamil/
gsyncd.log
log_level:INFO
log_rsync_performance:false
master_disperse_count:1
master_distribution_count:1
master_replica_count:1
max_rsync_retries:10
meta_volume_mnt:/var/run/gluster/shared_storage
pid_file:/var/run/gluster/gsyncd-archivosvao-samil-archivossamil.pid
remote_gsyncd:
replica_failover_interval:1
rsync_command:rsync
rsync_opt_existing:true
rsync_opt_ignore_missing_args:true
rsync_options:
rsync_ssh_options:
slave_access_mount:false
slave_gluster_command_dir:/usr/sbin
slave_gluster_log_file:/var/log/glusterfs/geo-replication-slaves/
archivosvao_samil_archivossamil/mnt-${master_node}-${master_brick_id}.log
slave_gluster_log_file_mbr:/var/log/glusterfs/geo-replication-slaves/
archivosvao_samil_archivossamil/mnt-mbr-${master_node}-${master_brick_id}.log
slave_gluster_log_level:INFO
slave_gluster_params:aux-gfid-mount acl
slave_log_file:/var/log/glusterfs/geo-replication-slaves/
archivosvao_samil_archivossamil/gsyncd.log
slave_log_level:INFO
slave_timeout:120
special_sync_mode:
ssh_command:ssh
ssh_options:-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/
lib/glusterd/geo-replication/secret.pem
ssh_options_tar:-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /
var/lib/glusterd/geo-replication/tar_ssh.pem
ssh_port:22
state_file:/var/lib/glusterd/geo-replication/archivosvao_samil_archivossamil/
monitor.status
state_socket_unencoded:
stime_xattr_prefix:trusted.glusterfs.c7fa7778-
f2e4-48f9-8817-5811c09964d5.8d4c7ef7-35fc-497a-9425-66f4aced159b
sync_acls:true
sync_jobs:3
sync_method:rsync
sync_xattrs:true
tar_command:tar
use_meta_volume:false
use_rsync_xattrs:false
working_dir:/var/lib/misc/gluster/gsyncd/archivosvao_samil_archivossamil/
gluster> volume info
Volume Name: archivossamil
Type: Distribute
Volume ID: 8d4c7ef7-35fc-497a-9425-66f4aced159b
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: samil:/brickarchivos/archivos
Options Reconfigured:
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
features.read-only: on
Volume Name: archivosvao
Type: Distribute
Volume ID: c7fa7778-f2e4-48f9-8817-5811c09964d5
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: vao:/brickarchivos/archivos
Options Reconfigured:
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on
Volume Name: home
Type: Replicate
Volume ID: 74522542-5d7a-4fdd-9cea-76bf1ff27e7d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: samil:/brickhome/home
Brick2: vao:/brickhome/home
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
These errors appear in the master logs:
.............
[2020-03-25 09:00:12.554226] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=1 num_files=2 return_code=0
duration=0.0483
[2020-03-25 09:00:12.772688] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=2 num_files=3 return_code=0
duration=0.0539
[2020-03-25 09:00:13.112986] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=1 num_files=2 return_code=0
duration=0.0575
[2020-03-25 09:00:13.311976] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=2 num_files=1 return_code=0
duration=0.0379
[2020-03-25 09:00:13.382845] I [master(worker /brickarchivos/archivos):
1227:process_change] _GMaster: Entry ops failed with gfid mismatch
count=1
[2020-03-25 09:00:13.385680] E [syncdutils(worker /brickarchivos/archivos):
339:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line
332, in main func(args)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/subcmds.py",
line 86, in subcmd_worker local.service_loop(remote)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py",
line 1297, in service_loop g3.crawlwrap(oneshot=True)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
602, in crawlwrap self.crawl()
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
1592, in crawl self.changelogs_batch_process(changes)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
1492, in changelogs_batch_process self.process(batch)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
1327, in process self.process_change(change, done, retry)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
1230, in process_change self.handle_entry_failures(failures, entries)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
973, in handle_entry_failures failures1, retries, entry_ops1)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
936, in fix_possible_entry_failures pargfid))
FileNotFoundError: [Errno 2] No such file or directory: '/brickarchivos/
archivos/.glusterfs/6e/eb/6eeb2c8f-da55-4066-995b-691290b69fdf'
[2020-03-25 09:00:13.435045] I [repce(agent /brickarchivos/archivos):
96:service_loop] RepceServer: terminating on reaching EOF.
[2020-03-25 09:00:14.248754] I [monitor(monitor):280:monitor] Monitor: worker
died in startup phase brick=/brickarchivos/archivos
[2020-03-25 09:00:16.83872] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change status=Faulty
[2020-03-25 09:00:36.304047] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change status=Initializing...
[2020-03-25 09:00:36.304274] I [monitor(monitor):159:monitor] Monitor:
starting gsyncd worker brick=/brickarchivos/archivos slave_node=samil
[2020-03-25 09:00:36.391111] I [gsyncd(agent /brickarchivos/archivos):
318:main] <top>: Using session config file path=/var/lib/glusterd/geo-
replication/archivosvao_samil_archivossamil/gsyncd.conf
[2020-03-25 09:00:36.392865] I [changelogagent(agent /brickarchivos/archivos):
72:__init__] ChangelogAgent: Agent listining...
[2020-03-25 09:00:36.399606] I [gsyncd(worker /brickarchivos/archivos):
318:main] <top>: Using session config file path=/var/lib/glusterd/geo-
replication/archivosvao_samil_archivossamil/gsyncd.conf
[2020-03-25 09:00:36.412956] I [resource(worker /brickarchivos/archivos):
1386:connect_remote] SSH: Initializing SSH connection between master and
slave...
[2020-03-25 09:00:37.772666] I [resource(worker /brickarchivos/archivos):
1435:connect_remote] SSH: SSH connection between master and slave established.
duration=1.3594
[2020-03-25 09:00:37.773320] I [resource(worker /brickarchivos/archivos):
1105:connect] GLUSTER: Mounting gluster volume locally...
[2020-03-25 09:00:38.821624] I [resource(worker /brickarchivos/archivos):
1128:connect] GLUSTER: Mounted gluster volume duration=1.0479
[2020-03-25 09:00:38.822003] I [subcmds(worker /brickarchivos/archivos):
84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to
monitor
[2020-03-25 09:00:41.797329] I [master(worker /brickarchivos/archivos):
1640:register] _GMaster: Working dir path=/var/lib/misc/gluster/gsyncd/
archivosvao_samil_archivossamil/brickarchivos-archivos
[2020-03-25 09:00:41.798168] I [resource(worker /brickarchivos/archivos):
1291:service_loop] GLUSTER: Register time time=1585126841
[2020-03-25 09:00:42.143373] I [gsyncdstatus(worker /brickarchivos/archivos):
281:set_active] GeorepStatus: Worker Status Change status=Active
[2020-03-25 09:00:42.310175] I [gsyncdstatus(worker /brickarchivos/archivos):
253:set_worker_crawl_status] GeorepStatus: Crawl Status Change
status=History Crawl
[2020-03-25 09:00:42.311381] I [master(worker /brickarchivos/archivos):
1554:crawl] _GMaster: starting history crawl turns=1 stime=(1585015849, 0)
etime=1585126842 entry_stime=(1585043575, 0)
[2020-03-25 09:00:43.347883] I [master(worker /brickarchivos/archivos):
1583:crawl] _GMaster: slave's time stime=(1585015849, 0)
[2020-03-25 09:00:43.932979] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=1 num_files=7 return_code=0
duration=0.1022
[2020-03-25 09:00:43.980473] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=1 num_files=1 return_code=0
duration=0.0467
[2020-03-25 09:00:44.387296] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=2 num_files=5 return_code=0
duration=0.0539
[2020-03-25 09:00:44.424803] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=2 num_files=1 return_code=0
duration=0.0368
[2020-03-25 09:00:44.877503] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=3 num_files=4 return_code=0
duration=0.0431
[2020-03-25 09:00:44.918785] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=3 num_files=3 return_code=0
duration=0.0403
[2020-03-25 09:00:45.20351] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=1 num_files=1 return_code=0
duration=0.0382
[2020-03-25 09:00:45.55611] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=1 num_files=1 return_code=0
duration=0.0344
[2020-03-25 09:00:45.90699] I [master(worker /brickarchivos/archivos):
1991:syncjob] Syncer: Sync Time Taken job=1 num_files=1 return_code=0
duration=0.0341
.............
It seems that the source of the error is the absence of this file:
FileNotFoundError: [Errno 2] No such file or directory: '/brickarchivos/
archivos/.glusterfs/6e/eb/6eeb2c8f-da55-4066-995b-691290b69fdf'
When the error appears, it tries to synchronize again and enters a retry loop.
I have stopped syncing and trying to resume it the next day. Now the error
appears in a different file, but always within the index path of the Gluster
[2020-03-23 16:49:20.729115] I [master(worker /brickarchivos/archivos):
1227:process_change] _GMaster: Entry ops failed with gfid mismatch
count=1
[2020-03-23 16:49:20.731028] E [syncdutils(worker /brickarchivos/archivos):
339:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line
332, in main func(args)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/subcmds.py",
line 86, in subcmd_worker local.service_loop(remote)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py",
line 1297, in service_loop g3.crawlwrap(oneshot=True)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
602, in crawlwrap self.crawl()
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
1592, in crawl self.changelogs_batch_process(changes)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
1492, in changelogs_batch_process self.process(batch)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
1327, in process self.process_change(change, done, retry)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
1230, in process_change self.handle_entry_failures(failures, entries)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
973, in handle_entry_failures failures1, retries, entry_ops1)
File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
936, in fix_possible_entry_failures
pargfid))
FileNotFoundError: [Errno 2] No such file or directory: '/brickarchivos/
archivos/.glusterfs/63/11/63113be6-0774-4719-96a6-619f7777aed2'
[2020-03-23 16:49:20.764215] I [repce(agent /brickarchivos/archivos):
96:service_loop] RepceServer: terminating on reaching EOF.
I have tried to remove the geo-replication and recreate it, but the problem
recurs.
I do not delete the slave data since they are more than 2.5 Tb and it would
take several days to synchronize again:
volume geo-replication archivosvao samil::archivossamil stop
volume geo-replication archivosvao samil::archivossamil delete
volume set archivosvao geo-replication.indexing off
volume geo-replication archivosvao samil::archivossamil create push-pem force
volume geo-replication archivosvao samil::archivossamil start
But without solution
Any help would be appreciated.
Thank you.
More information about the Gluster-users
mailing list