[Gluster-users] Geo-replication completely broken

Fri Jul 3 08:16:30 UTC 2020

Dear Users,
the geo-replication is still broken. This is not really a comfortable
situation.
Does any user has had the same experience and is able to share a
possible workaround?
We are actually running gluster v6.0
Regards,

Felix

On 25/06/2020 10:04, Shwetha Acharya wrote:
> Hi Rob and Felix,
>
> Please share the *-changes.log files and brick logs, which will help
> in analysis of the issue.
>
> Regards,
> Shwetha
>
> On Thu, Jun 25, 2020 at 1:26 PM Felix Kölzow <felix.koelzow at gmx.de
> <mailto:felix.koelzow at gmx.de>> wrote:
>
>     Hey Rob,
>
>
>     same issue for our third volume. Have a look at the logs just from
>     right now (below).
>
>     Question: You removed the htime files and the old changelogs. Just
>     rm the files or is there something to pay more attention
>
>     before removing the changelog files and the htime file.
>
>     Regards,
>
>     Felix
>
>     [2020-06-25 07:51:53.795430] I [resource(worker
>     /gluster/vg00/dispersed_fuse1024/brick):1435:connect_remote] SSH:
>     SSH connection between master and slave established.   
>     duration=1.2341
>     [2020-06-25 07:51:53.795639] I [resource(worker
>     /gluster/vg00/dispersed_fuse1024/brick):1105:connect] GLUSTER:
>     Mounting gluster volume locally...
>     [2020-06-25 07:51:54.520601] I [monitor(monitor):280:monitor]
>     Monitor: worker died in startup phase
>     brick=/gluster/vg01/dispersed_fuse1024/brick
>     [2020-06-25 07:51:54.535809] I
>     [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
>     Status Change    status=Faulty
>     [2020-06-25 07:51:54.882143] I [resource(worker
>     /gluster/vg00/dispersed_fuse1024/brick):1128:connect] GLUSTER:
>     Mounted gluster volume    duration=1.0864
>     [2020-06-25 07:51:54.882388] I [subcmds(worker
>     /gluster/vg00/dispersed_fuse1024/brick):84:subcmd_worker] <top>:
>     Worker spawn successful. Acknowledging back to monitor
>     [2020-06-25 07:51:56.911412] E [repce(agent
>     /gluster/vg00/dispersed_fuse1024/brick):121:worker] <top>: call
>     failed:
>     Traceback (most recent call last):
>       File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
>     117, in worker
>         res = getattr(self.obj, rmeth)(*in_data[2:])
>       File
>     "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line
>     40, in register
>         return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level,
>     retries)
>       File
>     "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
>     46, in cl_register
>         cls.raise_changelog_err()
>       File
>     "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
>     30, in raise_changelog_err
>         raise ChangelogException(errn, os.strerror(errn))
>     ChangelogException: [Errno 2] No such file or directory
>     [2020-06-25 07:51:56.912056] E [repce(worker
>     /gluster/vg00/dispersed_fuse1024/brick):213:__call__] RepceClient:
>     call failed call=75086:140098349655872:1593071514.91
>     method=register    error=ChangelogException
>     [2020-06-25 07:51:56.912396] E [resource(worker
>     /gluster/vg00/dispersed_fuse1024/brick):1286:service_loop]
>     GLUSTER: Changelog register failed    error=[Errno 2] No such file
>     or directory
>     [2020-06-25 07:51:56.928031] I [repce(agent
>     /gluster/vg00/dispersed_fuse1024/brick):96:service_loop]
>     RepceServer: terminating on reaching EOF.
>     [2020-06-25 07:51:57.886126] I [monitor(monitor):280:monitor]
>     Monitor: worker died in startup phase
>     brick=/gluster/vg00/dispersed_fuse1024/brick
>     [2020-06-25 07:51:57.895920] I
>     [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
>     Status Change    status=Faulty
>     [2020-06-25 07:51:58.607405] I [gsyncdstatus(worker
>     /gluster/vg00/dispersed_fuse1024/brick):287:set_passive]
>     GeorepStatus: Worker Status Change    status=Passive
>     [2020-06-25 07:51:58.607768] I [gsyncdstatus(worker
>     /gluster/vg01/dispersed_fuse1024/brick):287:set_passive]
>     GeorepStatus: Worker Status Change    status=Passive
>     [2020-06-25 07:51:58.608004] I [gsyncdstatus(worker
>     /gluster/vg00/dispersed_fuse1024/brick):281:set_active]
>     GeorepStatus: Worker Status Change    status=Active
>
>
>     On 25/06/2020 09:15, Rob.Quagliozzi at rabobank.com
>     <mailto:Rob.Quagliozzi at rabobank.com> wrote:
>>
>>     Hi All,
>>
>>     We’ve got two six node RHEL 7.8 clusters and geo-replication
>>     would appear to be completely broken between them. I’ve deleted
>>     the session, removed & recreated pem files, old changlogs/htime
>>     (after removing relevant options from volume) and completely set
>>     up geo-rep from scratch, but the new session comes up as
>>     Initializing, then goes faulty, and starts looping. Volume (on
>>     both sides) is a 4 x 2 disperse, running Gluster v6 (RH latest). 
>>     Gsyncd reports:
>>
>>     [2020-06-25 07:07:14.701423] I
>>     [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
>>     Worker Status Change status=Initializing...
>>
>>     [2020-06-25 07:07:14.701744] I [monitor(monitor):159:monitor]
>>     Monitor: starting gsyncd worker   brick=/rhgs/brick20/brick
>>     slave_node=bxts470194.eu.rabonet.com
>>     <http://bxts470194.eu.rabonet.com>
>>
>>     [2020-06-25 07:07:14.707997] D [monitor(monitor):230:monitor]
>>     Monitor: Worker would mount volume privately
>>
>>     [2020-06-25 07:07:14.757181] I [gsyncd(agent
>>     /rhgs/brick20/brick):318:main] <top>: Using session config file
>>     path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>>
>>     [2020-06-25 07:07:14.758126] D [subcmds(agent
>>     /rhgs/brick20/brick):107:subcmd_agent] <top>: RPC FD     
>>     rpc_fd='5,12,11,10'
>>
>>     [2020-06-25 07:07:14.758627] I [changelogagent(agent
>>     /rhgs/brick20/brick):72:__init__] ChangelogAgent: Agent listining...
>>
>>     [2020-06-25 07:07:14.764234] I [gsyncd(worker
>>     /rhgs/brick20/brick):318:main] <top>: Using session config file
>>     path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>>
>>     [2020-06-25 07:07:14.779409] I [resource(worker
>>     /rhgs/brick20/brick):1386:connect_remote] SSH: Initializing SSH
>>     connection between master and slave...
>>
>>     [2020-06-25 07:07:14.841793] D [repce(worker
>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>     6799:140380783982400:1593068834.84 __repce_version__() ...
>>
>>     [2020-06-25 07:07:16.148725] D [repce(worker
>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>     6799:140380783982400:1593068834.84 __repce_version__ -> 1.0
>>
>>     [2020-06-25 07:07:16.148911] D [repce(worker
>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>     6799:140380783982400:1593068836.15 version() ...
>>
>>     [2020-06-25 07:07:16.149574] D [repce(worker
>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>     6799:140380783982400:1593068836.15 version -> 1.0
>>
>>     [2020-06-25 07:07:16.149735] D [repce(worker
>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>     6799:140380783982400:1593068836.15 pid() ...
>>
>>     [2020-06-25 07:07:16.150588] D [repce(worker
>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>     6799:140380783982400:1593068836.15 pid -> 30703
>>
>>     [2020-06-25 07:07:16.150747] I [resource(worker
>>     /rhgs/brick20/brick):1435:connect_remote] SSH: SSH connection
>>     between master and slave established. duration=1.3712
>>
>>     [2020-06-25 07:07:16.150819] I [resource(worker
>>     /rhgs/brick20/brick):1105:connect] GLUSTER: Mounting gluster
>>     volume locally...
>>
>>     [2020-06-25 07:07:16.265860] D [resource(worker
>>     /rhgs/brick20/brick):879:inhibit] DirectMounter: auxiliary
>>     glusterfs mount in place
>>
>>     [2020-06-25 07:07:17.272511] D [resource(worker
>>     /rhgs/brick20/brick):953:inhibit] DirectMounter: auxiliary
>>     glusterfs mount prepared
>>
>>     [2020-06-25 07:07:17.272708] I [resource(worker
>>     /rhgs/brick20/brick):1128:connect] GLUSTER: Mounted gluster
>>     volume      duration=1.1218
>>
>>     [2020-06-25 07:07:17.272794] I [subcmds(worker
>>     /rhgs/brick20/brick):84:subcmd_worker] <top>: Worker spawn
>>     successful. Acknowledging back to monitor
>>
>>     [2020-06-25 07:07:17.272973] D [master(worker
>>     /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
>>     change detection mode mode=xsync
>>
>>     [2020-06-25 07:07:17.273063] D [monitor(monitor):273:monitor]
>>     Monitor: worker(/rhgs/brick20/brick) connected
>>
>>     [2020-06-25 07:07:17.273678] D [master(worker
>>     /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
>>     change detection mode mode=changelog
>>
>>     [2020-06-25 07:07:17.274224] D [master(worker
>>     /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
>>     change detection mode mode=changeloghistory
>>
>>     [2020-06-25 07:07:17.276484] D [repce(worker
>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>     6799:140380783982400:1593068837.28 version() ...
>>
>>     [2020-06-25 07:07:17.276916] D [repce(worker
>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>     6799:140380783982400:1593068837.28 version -> 1.0
>>
>>     [2020-06-25 07:07:17.277009] D [master(worker
>>     /rhgs/brick20/brick):777:setup_working_dir] _GMaster: changelog
>>     working dir
>>     /var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick
>>
>>     [2020-06-25 07:07:17.277098] D [repce(worker
>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>     6799:140380783982400:1593068837.28 init() ...
>>
>>     [2020-06-25 07:07:17.292944] D [repce(worker
>>     /rhgs/brick20/brick):215:__call__] RepceClient: call
>>     6799:140380783982400:1593068837.28 init -> None
>>
>>     [2020-06-25 07:07:17.293097] D [repce(worker
>>     /rhgs/brick20/brick):195:push] RepceClient: call
>>     6799:140380783982400:1593068837.29
>>     register('/rhgs/brick20/brick',
>>     '/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick',
>>     '/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log',
>>     8, 5) ...
>>
>>     [2020-06-25 07:07:19.296294] E [repce(agent
>>     /rhgs/brick20/brick):121:worker] <top>: call failed:
>>
>>     Traceback (most recent call last):
>>
>>       File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
>>     117, in worker
>>
>>         res = getattr(self.obj, rmeth)(*in_data[2:])
>>
>>       File
>>     "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
>>     line 40, in register
>>
>>         return Changes.cl_register(cl_brick, cl_dir, cl_log,
>>     cl_level, retries)
>>
>>       File
>>     "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
>>     line 46, in cl_register
>>
>>         cls.raise_changelog_err()
>>
>>       File
>>     "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
>>     line 30, in raise_changelog_err
>>
>>         raise ChangelogException(errn, os.strerror(errn))
>>
>>     ChangelogException: [Errno 2] No such file or directory
>>
>>     [2020-06-25 07:07:19.297161] E [repce(worker
>>     /rhgs/brick20/brick):213:__call__] RepceClient: call failed
>>     call=6799:140380783982400:1593068837.29 method=register
>>     error=ChangelogException
>>
>>     [2020-06-25 07:07:19.297338] E [resource(worker
>>     /rhgs/brick20/brick):1286:service_loop] GLUSTER: Changelog
>>     register failed      error=[Errno 2] No such file or directory
>>
>>     [2020-06-25 07:07:19.315074] I [repce(agent
>>     /rhgs/brick20/brick):96:service_loop] RepceServer: terminating on
>>     reaching EOF.
>>
>>     [2020-06-25 07:07:20.275701] I [monitor(monitor):280:monitor]
>>     Monitor: worker died in startup phase     brick=/rhgs/brick20/brick
>>
>>     [2020-06-25 07:07:20.277383] I
>>     [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
>>     Worker Status Change status=Faulty
>>
>>     We’ve done everything we can think of, including an “strace –f”
>>     on the pid, and we can’t really find anything. I’m about to lose
>>     the last of my hair over this, so does anyone have any ideas at
>>     all? We’ve even removed the entire slave vol and rebuilt it.
>>
>>     Thanks
>>
>>     Rob
>>
>>     *Rob Quagliozzi*
>>
>>     *Specialised Application Support*
>>
>>
>>
>>     ------------------------------------------------------------------------
>>     This email (including any attachments to it) is confidential,
>>     legally privileged, subject to copyright and is sent for the
>>     personal attention of the intended recipient only. If you have
>>     received this email in error, please advise us immediately and
>>     delete it. You are notified that disclosing, copying,
>>     distributing or taking any action in reliance on the contents of
>>     this information is strictly prohibited. Although we have taken
>>     reasonable precautions to ensure no viruses are present in this
>>     email, we cannot accept responsibility for any loss or damage
>>     arising from the viruses in this email or attachments. We exclude
>>     any liability for the content of this email, or for the
>>     consequences of any actions taken on the basis of the information
>>     provided in this email or its attachments, unless that
>>     information is subsequently confirmed in writing. <#rbnl#1898i>
>>     ------------------------------------------------------------------------
>>
>>
>>     ________
>>
>>
>>
>>     Community Meeting Calendar:
>>
>>     Schedule -
>>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>     Bridge:https://bluejeans.com/441850968
>>
>>     Gluster-users mailing list
>>     Gluster-users at gluster.org  <mailto:Gluster-users at gluster.org>
>>     https://lists.gluster.org/mailman/listinfo/gluster-users
>     ________
>
>
>
>     Community Meeting Calendar:
>
>     Schedule -
>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     Bridge: https://bluejeans.com/441850968
>
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>     https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200703/993e8f43/attachment.html>