[Gluster-users] Geo-replication completely broken
Felix Kölzow
felix.koelzow at gmx.de
Fri Jul 3 08:16:30 UTC 2020
Dear Users,
the geo-replication is still broken. This is not really a comfortable
situation.
Does any user has had the same experience and is able to share a
possible workaround?
We are actually running gluster v6.0
Regards,
Felix
On 25/06/2020 10:04, Shwetha Acharya wrote:
> Hi Rob and Felix,
>
> Please share the *-changes.log files and brick logs, which will help
> in analysis of the issue.
>
> Regards,
> Shwetha
>
> On Thu, Jun 25, 2020 at 1:26 PM Felix Kölzow <felix.koelzow at gmx.de
> <mailto:felix.koelzow at gmx.de>> wrote:
>
> Hey Rob,
>
>
> same issue for our third volume. Have a look at the logs just from
> right now (below).
>
> Question: You removed the htime files and the old changelogs. Just
> rm the files or is there something to pay more attention
>
> before removing the changelog files and the htime file.
>
> Regards,
>
> Felix
>
> [2020-06-25 07:51:53.795430] I [resource(worker
> /gluster/vg00/dispersed_fuse1024/brick):1435:connect_remote] SSH:
> SSH connection between master and slave established.
> duration=1.2341
> [2020-06-25 07:51:53.795639] I [resource(worker
> /gluster/vg00/dispersed_fuse1024/brick):1105:connect] GLUSTER:
> Mounting gluster volume locally...
> [2020-06-25 07:51:54.520601] I [monitor(monitor):280:monitor]
> Monitor: worker died in startup phase
> brick=/gluster/vg01/dispersed_fuse1024/brick
> [2020-06-25 07:51:54.535809] I
> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
> Status Change status=Faulty
> [2020-06-25 07:51:54.882143] I [resource(worker
> /gluster/vg00/dispersed_fuse1024/brick):1128:connect] GLUSTER:
> Mounted gluster volume duration=1.0864
> [2020-06-25 07:51:54.882388] I [subcmds(worker
> /gluster/vg00/dispersed_fuse1024/brick):84:subcmd_worker] <top>:
> Worker spawn successful. Acknowledging back to monitor
> [2020-06-25 07:51:56.911412] E [repce(agent
> /gluster/vg00/dispersed_fuse1024/brick):121:worker] <top>: call
> failed:
> Traceback (most recent call last):
> File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
> 117, in worker
> res = getattr(self.obj, rmeth)(*in_data[2:])
> File
> "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line
> 40, in register
> return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level,
> retries)
> File
> "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
> 46, in cl_register
> cls.raise_changelog_err()
> File
> "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
> 30, in raise_changelog_err
> raise ChangelogException(errn, os.strerror(errn))
> ChangelogException: [Errno 2] No such file or directory
> [2020-06-25 07:51:56.912056] E [repce(worker
> /gluster/vg00/dispersed_fuse1024/brick):213:__call__] RepceClient:
> call failed call=75086:140098349655872:1593071514.91
> method=register error=ChangelogException
> [2020-06-25 07:51:56.912396] E [resource(worker
> /gluster/vg00/dispersed_fuse1024/brick):1286:service_loop]
> GLUSTER: Changelog register failed error=[Errno 2] No such file
> or directory
> [2020-06-25 07:51:56.928031] I [repce(agent
> /gluster/vg00/dispersed_fuse1024/brick):96:service_loop]
> RepceServer: terminating on reaching EOF.
> [2020-06-25 07:51:57.886126] I [monitor(monitor):280:monitor]
> Monitor: worker died in startup phase
> brick=/gluster/vg00/dispersed_fuse1024/brick
> [2020-06-25 07:51:57.895920] I
> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker
> Status Change status=Faulty
> [2020-06-25 07:51:58.607405] I [gsyncdstatus(worker
> /gluster/vg00/dispersed_fuse1024/brick):287:set_passive]
> GeorepStatus: Worker Status Change status=Passive
> [2020-06-25 07:51:58.607768] I [gsyncdstatus(worker
> /gluster/vg01/dispersed_fuse1024/brick):287:set_passive]
> GeorepStatus: Worker Status Change status=Passive
> [2020-06-25 07:51:58.608004] I [gsyncdstatus(worker
> /gluster/vg00/dispersed_fuse1024/brick):281:set_active]
> GeorepStatus: Worker Status Change status=Active
>
>
> On 25/06/2020 09:15, Rob.Quagliozzi at rabobank.com
> <mailto:Rob.Quagliozzi at rabobank.com> wrote:
>>
>> Hi All,
>>
>> We’ve got two six node RHEL 7.8 clusters and geo-replication
>> would appear to be completely broken between them. I’ve deleted
>> the session, removed & recreated pem files, old changlogs/htime
>> (after removing relevant options from volume) and completely set
>> up geo-rep from scratch, but the new session comes up as
>> Initializing, then goes faulty, and starts looping. Volume (on
>> both sides) is a 4 x 2 disperse, running Gluster v6 (RH latest).
>> Gsyncd reports:
>>
>> [2020-06-25 07:07:14.701423] I
>> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
>> Worker Status Change status=Initializing...
>>
>> [2020-06-25 07:07:14.701744] I [monitor(monitor):159:monitor]
>> Monitor: starting gsyncd worker brick=/rhgs/brick20/brick
>> slave_node=bxts470194.eu.rabonet.com
>> <http://bxts470194.eu.rabonet.com>
>>
>> [2020-06-25 07:07:14.707997] D [monitor(monitor):230:monitor]
>> Monitor: Worker would mount volume privately
>>
>> [2020-06-25 07:07:14.757181] I [gsyncd(agent
>> /rhgs/brick20/brick):318:main] <top>: Using session config file
>> path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>>
>> [2020-06-25 07:07:14.758126] D [subcmds(agent
>> /rhgs/brick20/brick):107:subcmd_agent] <top>: RPC FD
>> rpc_fd='5,12,11,10'
>>
>> [2020-06-25 07:07:14.758627] I [changelogagent(agent
>> /rhgs/brick20/brick):72:__init__] ChangelogAgent: Agent listining...
>>
>> [2020-06-25 07:07:14.764234] I [gsyncd(worker
>> /rhgs/brick20/brick):318:main] <top>: Using session config file
>> path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>>
>> [2020-06-25 07:07:14.779409] I [resource(worker
>> /rhgs/brick20/brick):1386:connect_remote] SSH: Initializing SSH
>> connection between master and slave...
>>
>> [2020-06-25 07:07:14.841793] D [repce(worker
>> /rhgs/brick20/brick):195:push] RepceClient: call
>> 6799:140380783982400:1593068834.84 __repce_version__() ...
>>
>> [2020-06-25 07:07:16.148725] D [repce(worker
>> /rhgs/brick20/brick):215:__call__] RepceClient: call
>> 6799:140380783982400:1593068834.84 __repce_version__ -> 1.0
>>
>> [2020-06-25 07:07:16.148911] D [repce(worker
>> /rhgs/brick20/brick):195:push] RepceClient: call
>> 6799:140380783982400:1593068836.15 version() ...
>>
>> [2020-06-25 07:07:16.149574] D [repce(worker
>> /rhgs/brick20/brick):215:__call__] RepceClient: call
>> 6799:140380783982400:1593068836.15 version -> 1.0
>>
>> [2020-06-25 07:07:16.149735] D [repce(worker
>> /rhgs/brick20/brick):195:push] RepceClient: call
>> 6799:140380783982400:1593068836.15 pid() ...
>>
>> [2020-06-25 07:07:16.150588] D [repce(worker
>> /rhgs/brick20/brick):215:__call__] RepceClient: call
>> 6799:140380783982400:1593068836.15 pid -> 30703
>>
>> [2020-06-25 07:07:16.150747] I [resource(worker
>> /rhgs/brick20/brick):1435:connect_remote] SSH: SSH connection
>> between master and slave established. duration=1.3712
>>
>> [2020-06-25 07:07:16.150819] I [resource(worker
>> /rhgs/brick20/brick):1105:connect] GLUSTER: Mounting gluster
>> volume locally...
>>
>> [2020-06-25 07:07:16.265860] D [resource(worker
>> /rhgs/brick20/brick):879:inhibit] DirectMounter: auxiliary
>> glusterfs mount in place
>>
>> [2020-06-25 07:07:17.272511] D [resource(worker
>> /rhgs/brick20/brick):953:inhibit] DirectMounter: auxiliary
>> glusterfs mount prepared
>>
>> [2020-06-25 07:07:17.272708] I [resource(worker
>> /rhgs/brick20/brick):1128:connect] GLUSTER: Mounted gluster
>> volume duration=1.1218
>>
>> [2020-06-25 07:07:17.272794] I [subcmds(worker
>> /rhgs/brick20/brick):84:subcmd_worker] <top>: Worker spawn
>> successful. Acknowledging back to monitor
>>
>> [2020-06-25 07:07:17.272973] D [master(worker
>> /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
>> change detection mode mode=xsync
>>
>> [2020-06-25 07:07:17.273063] D [monitor(monitor):273:monitor]
>> Monitor: worker(/rhgs/brick20/brick) connected
>>
>> [2020-06-25 07:07:17.273678] D [master(worker
>> /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
>> change detection mode mode=changelog
>>
>> [2020-06-25 07:07:17.274224] D [master(worker
>> /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up
>> change detection mode mode=changeloghistory
>>
>> [2020-06-25 07:07:17.276484] D [repce(worker
>> /rhgs/brick20/brick):195:push] RepceClient: call
>> 6799:140380783982400:1593068837.28 version() ...
>>
>> [2020-06-25 07:07:17.276916] D [repce(worker
>> /rhgs/brick20/brick):215:__call__] RepceClient: call
>> 6799:140380783982400:1593068837.28 version -> 1.0
>>
>> [2020-06-25 07:07:17.277009] D [master(worker
>> /rhgs/brick20/brick):777:setup_working_dir] _GMaster: changelog
>> working dir
>> /var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick
>>
>> [2020-06-25 07:07:17.277098] D [repce(worker
>> /rhgs/brick20/brick):195:push] RepceClient: call
>> 6799:140380783982400:1593068837.28 init() ...
>>
>> [2020-06-25 07:07:17.292944] D [repce(worker
>> /rhgs/brick20/brick):215:__call__] RepceClient: call
>> 6799:140380783982400:1593068837.28 init -> None
>>
>> [2020-06-25 07:07:17.293097] D [repce(worker
>> /rhgs/brick20/brick):195:push] RepceClient: call
>> 6799:140380783982400:1593068837.29
>> register('/rhgs/brick20/brick',
>> '/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick',
>> '/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log',
>> 8, 5) ...
>>
>> [2020-06-25 07:07:19.296294] E [repce(agent
>> /rhgs/brick20/brick):121:worker] <top>: call failed:
>>
>> Traceback (most recent call last):
>>
>> File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
>> 117, in worker
>>
>> res = getattr(self.obj, rmeth)(*in_data[2:])
>>
>> File
>> "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py",
>> line 40, in register
>>
>> return Changes.cl_register(cl_brick, cl_dir, cl_log,
>> cl_level, retries)
>>
>> File
>> "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
>> line 46, in cl_register
>>
>> cls.raise_changelog_err()
>>
>> File
>> "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py",
>> line 30, in raise_changelog_err
>>
>> raise ChangelogException(errn, os.strerror(errn))
>>
>> ChangelogException: [Errno 2] No such file or directory
>>
>> [2020-06-25 07:07:19.297161] E [repce(worker
>> /rhgs/brick20/brick):213:__call__] RepceClient: call failed
>> call=6799:140380783982400:1593068837.29 method=register
>> error=ChangelogException
>>
>> [2020-06-25 07:07:19.297338] E [resource(worker
>> /rhgs/brick20/brick):1286:service_loop] GLUSTER: Changelog
>> register failed error=[Errno 2] No such file or directory
>>
>> [2020-06-25 07:07:19.315074] I [repce(agent
>> /rhgs/brick20/brick):96:service_loop] RepceServer: terminating on
>> reaching EOF.
>>
>> [2020-06-25 07:07:20.275701] I [monitor(monitor):280:monitor]
>> Monitor: worker died in startup phase brick=/rhgs/brick20/brick
>>
>> [2020-06-25 07:07:20.277383] I
>> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus:
>> Worker Status Change status=Faulty
>>
>> We’ve done everything we can think of, including an “strace –f”
>> on the pid, and we can’t really find anything. I’m about to lose
>> the last of my hair over this, so does anyone have any ideas at
>> all? We’ve even removed the entire slave vol and rebuilt it.
>>
>> Thanks
>>
>> Rob
>>
>> *Rob Quagliozzi*
>>
>> *Specialised Application Support*
>>
>>
>>
>> ------------------------------------------------------------------------
>> This email (including any attachments to it) is confidential,
>> legally privileged, subject to copyright and is sent for the
>> personal attention of the intended recipient only. If you have
>> received this email in error, please advise us immediately and
>> delete it. You are notified that disclosing, copying,
>> distributing or taking any action in reliance on the contents of
>> this information is strictly prohibited. Although we have taken
>> reasonable precautions to ensure no viruses are present in this
>> email, we cannot accept responsibility for any loss or damage
>> arising from the viruses in this email or attachments. We exclude
>> any liability for the content of this email, or for the
>> consequences of any actions taken on the basis of the information
>> provided in this email or its attachments, unless that
>> information is subsequently confirmed in writing. <#rbnl#1898i>
>> ------------------------------------------------------------------------
>>
>>
>> ________
>>
>>
>>
>> Community Meeting Calendar:
>>
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge:https://bluejeans.com/441850968
>>
>> Gluster-users mailing list
>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>> https://lists.gluster.org/mailman/listinfo/gluster-users
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200703/993e8f43/attachment.html>
More information about the Gluster-users
mailing list