[Gluster-users] Geo-Replication File not Found on /.glusterfs/XX/XX/XXXXXXXXXXXX

Wed Mar 25 18:08:58 UTC 2020

Hi Senén,

By any chance you perform any operation   on slave volume; like
deleting data directly from slave volume.

Also If possible please share geo-rep slave logs.

/sunny

On Wed, Mar 25, 2020 at 9:15 AM Senén Vidal Blanco
<senenvidal at sgisoft.com> wrote:
>
> Hi,
> I have a problem with the Geo-Replication system.
> The first synchronization was successful a few days ago. But after a bit of
> filming I run into an error message preventing the sync from continuing.
> I summarize a little the data on the configuration:
>
> Debian 10
> Glusterfs 7.3
> Master volume: archivosvao
> Slave volume: archivossamil
>
> volume geo-replication archivosvao samil::archivossamil config
> access_mount:false
> allow_network:
> change_detector:changelog
> change_interval:5
> changelog_archive_format:%Y%m
> changelog_batch_size:727040
> changelog_log_file:/var/log/glusterfs/geo-replication/
> archivosvao_samil_archivossamil/changes-${local_id}.log
> changelog_log_level:INFO
> checkpoint:0
> cli_log_file:/var/log/glusterfs/geo-replication/cli.log
> cli_log_level:INFO
> connection_timeout:60
> georep_session_working_dir:/var/lib/glusterd/geo-replication/
> archivosvao_samil_archivossamil/
> gfid_conflict_resolution:true
> gluster_cli_options:
> gluster_command:gluster
> gluster_command_dir:/usr/sbin
> gluster_log_file:/var/log/glusterfs/geo-replication/
> archivosvao_samil_archivossamil/mnt-${local_id}.log
> gluster_log_level:INFO
> gluster_logdir:/var/log/glusterfs
> gluster_params:aux-gfid-mount acl
> gluster_rundir:/var/run/gluster
> glusterd_workdir:/var/lib/glusterd
> gsyncd_miscdir:/var/lib/misc/gluster/gsyncd
> ignore_deletes:false
> isolated_slaves:
> log_file:/var/log/glusterfs/geo-replication/archivosvao_samil_archivossamil/
> gsyncd.log
> log_level:INFO
> log_rsync_performance:false
> master_disperse_count:1
> master_distribution_count:1
> master_replica_count:1
> max_rsync_retries:10
> meta_volume_mnt:/var/run/gluster/shared_storage
> pid_file:/var/run/gluster/gsyncd-archivosvao-samil-archivossamil.pid
> remote_gsyncd:
> replica_failover_interval:1
> rsync_command:rsync
> rsync_opt_existing:true
> rsync_opt_ignore_missing_args:true
> rsync_options:
> rsync_ssh_options:
> slave_access_mount:false
> slave_gluster_command_dir:/usr/sbin
> slave_gluster_log_file:/var/log/glusterfs/geo-replication-slaves/
> archivosvao_samil_archivossamil/mnt-${master_node}-${master_brick_id}.log
> slave_gluster_log_file_mbr:/var/log/glusterfs/geo-replication-slaves/
> archivosvao_samil_archivossamil/mnt-mbr-${master_node}-${master_brick_id}.log
> slave_gluster_log_level:INFO
> slave_gluster_params:aux-gfid-mount acl
> slave_log_file:/var/log/glusterfs/geo-replication-slaves/
> archivosvao_samil_archivossamil/gsyncd.log
> slave_log_level:INFO
> slave_timeout:120
> special_sync_mode:
> ssh_command:ssh
> ssh_options:-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/
> lib/glusterd/geo-replication/secret.pem
> ssh_options_tar:-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /
> var/lib/glusterd/geo-replication/tar_ssh.pem
> ssh_port:22
> state_file:/var/lib/glusterd/geo-replication/archivosvao_samil_archivossamil/
> monitor.status
> state_socket_unencoded:
> stime_xattr_prefix:trusted.glusterfs.c7fa7778-
> f2e4-48f9-8817-5811c09964d5.8d4c7ef7-35fc-497a-9425-66f4aced159b
> sync_acls:true
> sync_jobs:3
> sync_method:rsync
> sync_xattrs:true
> tar_command:tar
> use_meta_volume:false
> use_rsync_xattrs:false
> working_dir:/var/lib/misc/gluster/gsyncd/archivosvao_samil_archivossamil/
>
>
> gluster> volume info
>
> Volume Name: archivossamil
> Type: Distribute
> Volume ID: 8d4c7ef7-35fc-497a-9425-66f4aced159b
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1
> Transport-type: tcp
> Bricks:
> Brick1: samil:/brickarchivos/archivos
> Options Reconfigured:
> nfs.disable: on
> storage.fips-mode-rchecksum: on
> transport.address-family: inet
> features.read-only: on
>
> Volume Name: archivosvao
> Type: Distribute
> Volume ID: c7fa7778-f2e4-48f9-8817-5811c09964d5
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1
> Transport-type: tcp
> Bricks:
> Brick1: vao:/brickarchivos/archivos
> Options Reconfigured:
> nfs.disable: on
> storage.fips-mode-rchecksum: on
> transport.address-family: inet
> geo-replication.indexing: on
> geo-replication.ignore-pid-check: on
> changelog.changelog: on
>
> Volume Name: home
> Type: Replicate
> Volume ID: 74522542-5d7a-4fdd-9cea-76bf1ff27e7d
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: samil:/brickhome/home
> Brick2: vao:/brickhome/home
> Options Reconfigured:
> performance.client-io-threads: off
> nfs.disable: on
> storage.fips-mode-rchecksum: on
> transport.address-family: inet
>
>
> These errors appear in the master logs:
>
>
>
> .............
>
> [2020-03-25 09:00:12.554226] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken    job=1   num_files=2     return_code=0
> duration=0.0483
> [2020-03-25 09:00:12.772688] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken    job=2   num_files=3     return_code=0
> duration=0.0539
> [2020-03-25 09:00:13.112986] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken    job=1   num_files=2     return_code=0
> duration=0.0575
> [2020-03-25 09:00:13.311976] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken    job=2   num_files=1     return_code=0
> duration=0.0379
> [2020-03-25 09:00:13.382845] I [master(worker /brickarchivos/archivos):
> 1227:process_change] _GMaster: Entry ops failed with gfid mismatch
> count=1
> [2020-03-25 09:00:13.385680] E [syncdutils(worker /brickarchivos/archivos):
> 339:log_raise_exception] <top>: FAIL:
> Traceback (most recent call last):
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line
> 332, in main func(args)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/subcmds.py",
> line 86, in subcmd_worker    local.service_loop(remote)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py",
> line 1297, in service_loop    g3.crawlwrap(oneshot=True)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 602, in crawlwrap    self.crawl()
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 1592, in crawl    self.changelogs_batch_process(changes)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 1492, in changelogs_batch_process    self.process(batch)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 1327, in process    self.process_change(change, done, retry)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 1230, in process_change    self.handle_entry_failures(failures, entries)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 973, in handle_entry_failures    failures1, retries, entry_ops1)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 936, in fix_possible_entry_failures    pargfid))
> FileNotFoundError: [Errno 2] No such file or directory: '/brickarchivos/
> archivos/.glusterfs/6e/eb/6eeb2c8f-da55-4066-995b-691290b69fdf'
> [2020-03-25 09:00:13.435045] I [repce(agent /brickarchivos/archivos):
> 96:service_loop] RepceServer: terminating on reaching EOF.
> [2020-03-25 09:00:14.248754] I [monitor(monitor):280:monitor] Monitor: worker
> died in startup phase     brick=/brickarchivos/archivos
> [2020-03-25 09:00:16.83872] I [gsyncdstatus(monitor):248:set_worker_status]
> GeorepStatus: Worker Status Change  status=Faulty
> [2020-03-25 09:00:36.304047] I [gsyncdstatus(monitor):248:set_worker_status]
> GeorepStatus: Worker Status Change status=Initializing...
> [2020-03-25 09:00:36.304274] I [monitor(monitor):159:monitor] Monitor:
> starting gsyncd worker   brick=/brickarchivos/archivos   slave_node=samil
> [2020-03-25 09:00:36.391111] I [gsyncd(agent /brickarchivos/archivos):
> 318:main] <top>: Using session config file        path=/var/lib/glusterd/geo-
> replication/archivosvao_samil_archivossamil/gsyncd.conf
> [2020-03-25 09:00:36.392865] I [changelogagent(agent /brickarchivos/archivos):
> 72:__init__] ChangelogAgent: Agent listining...
> [2020-03-25 09:00:36.399606] I [gsyncd(worker /brickarchivos/archivos):
> 318:main] <top>: Using session config file       path=/var/lib/glusterd/geo-
> replication/archivosvao_samil_archivossamil/gsyncd.conf
> [2020-03-25 09:00:36.412956] I [resource(worker /brickarchivos/archivos):
> 1386:connect_remote] SSH: Initializing SSH connection between master and
> slave...
> [2020-03-25 09:00:37.772666] I [resource(worker /brickarchivos/archivos):
> 1435:connect_remote] SSH: SSH connection between master and slave established.
> duration=1.3594
> [2020-03-25 09:00:37.773320] I [resource(worker /brickarchivos/archivos):
> 1105:connect] GLUSTER: Mounting gluster volume locally...
> [2020-03-25 09:00:38.821624] I [resource(worker /brickarchivos/archivos):
> 1128:connect] GLUSTER: Mounted gluster volume  duration=1.0479
> [2020-03-25 09:00:38.822003] I [subcmds(worker /brickarchivos/archivos):
> 84:subcmd_worker] <top>: Worker spawn successful. Acknowledging back to
> monitor
> [2020-03-25 09:00:41.797329] I [master(worker /brickarchivos/archivos):
> 1640:register] _GMaster: Working dir     path=/var/lib/misc/gluster/gsyncd/
> archivosvao_samil_archivossamil/brickarchivos-archivos
> [2020-03-25 09:00:41.798168] I [resource(worker /brickarchivos/archivos):
> 1291:service_loop] GLUSTER: Register time      time=1585126841
> [2020-03-25 09:00:42.143373] I [gsyncdstatus(worker /brickarchivos/archivos):
> 281:set_active] GeorepStatus: Worker Status Change status=Active
> [2020-03-25 09:00:42.310175] I [gsyncdstatus(worker /brickarchivos/archivos):
> 253:set_worker_crawl_status] GeorepStatus: Crawl Status Change
> status=History Crawl
> [2020-03-25 09:00:42.311381] I [master(worker /brickarchivos/archivos):
> 1554:crawl] _GMaster: starting history crawl     turns=1 stime=(1585015849, 0)
> etime=1585126842        entry_stime=(1585043575, 0)
> [2020-03-25 09:00:43.347883] I [master(worker /brickarchivos/archivos):
> 1583:crawl] _GMaster: slave's time       stime=(1585015849, 0)
> [2020-03-25 09:00:43.932979] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken    job=1   num_files=7     return_code=0
> duration=0.1022
> [2020-03-25 09:00:43.980473] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken    job=1   num_files=1     return_code=0
> duration=0.0467
> [2020-03-25 09:00:44.387296] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken    job=2   num_files=5     return_code=0
> duration=0.0539
> [2020-03-25 09:00:44.424803] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken    job=2   num_files=1     return_code=0
> duration=0.0368
> [2020-03-25 09:00:44.877503] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken    job=3   num_files=4     return_code=0
> duration=0.0431
> [2020-03-25 09:00:44.918785] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken    job=3   num_files=3     return_code=0
> duration=0.0403
> [2020-03-25 09:00:45.20351] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken     job=1   num_files=1     return_code=0
> duration=0.0382
> [2020-03-25 09:00:45.55611] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken     job=1   num_files=1     return_code=0
> duration=0.0344
> [2020-03-25 09:00:45.90699] I [master(worker /brickarchivos/archivos):
> 1991:syncjob] Syncer: Sync Time Taken     job=1   num_files=1     return_code=0
> duration=0.0341
>
> .............
>
>
> It seems that the source of the error is the absence of this file:
>
> FileNotFoundError: [Errno 2] No such file or directory: '/brickarchivos/
> archivos/.glusterfs/6e/eb/6eeb2c8f-da55-4066-995b-691290b69fdf'
>
>
> When the error appears, it tries to synchronize again and enters a retry loop.
>
> I have stopped syncing and trying to resume it the next day. Now the error
> appears in a different file, but always within the index path of the Gluster
>
>
> [2020-03-23 16:49:20.729115] I [master(worker /brickarchivos/archivos):
> 1227:process_change] _GMaster: Entry ops failed with gfid mismatch
> count=1
> [2020-03-23 16:49:20.731028] E [syncdutils(worker /brickarchivos/archivos):
> 339:log_raise_exception] <top>: FAIL:
> Traceback (most recent call last):
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line
> 332, in main    func(args)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/subcmds.py",
> line 86, in subcmd_worker    local.service_loop(remote)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py",
> line 1297, in service_loop    g3.crawlwrap(oneshot=True)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 602, in crawlwrap    self.crawl()
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 1592, in crawl    self.changelogs_batch_process(changes)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 1492, in changelogs_batch_process    self.process(batch)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 1327, in process    self.process_change(change, done, retry)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 1230, in process_change    self.handle_entry_failures(failures, entries)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 973, in handle_entry_failures    failures1, retries, entry_ops1)
>   File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line
> 936, in fix_possible_entry_failures
>     pargfid))
> FileNotFoundError: [Errno 2] No such file or directory: '/brickarchivos/
> archivos/.glusterfs/63/11/63113be6-0774-4719-96a6-619f7777aed2'
> [2020-03-23 16:49:20.764215] I [repce(agent /brickarchivos/archivos):
> 96:service_loop] RepceServer: terminating on reaching EOF.
>
> I have tried to remove the geo-replication and recreate it, but the problem
> recurs.
> I do not delete the slave data since they are more than 2.5 Tb and it would
> take several days to synchronize again:
>
> volume geo-replication archivosvao samil::archivossamil stop
> volume geo-replication archivosvao samil::archivossamil delete
> volume set archivosvao geo-replication.indexing off
> volume geo-replication archivosvao samil::archivossamil create push-pem force
> volume geo-replication archivosvao samil::archivossamil start
>
> But without solution
>
> Any help would be appreciated.
> Thank you.
>
>
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>