[Gluster-users] Geo-replication completely broken

Thu Jun 25 08:04:23 UTC 2020

Hi Rob and Felix,

Please share the *-changes.log files and brick logs, which will help in
analysis of the issue.

Regards,
Shwetha

On Thu, Jun 25, 2020 at 1:26 PM Felix Kölzow <felix.koelzow at gmx.de> wrote:

> Hey Rob,
>
>
> same issue for our third volume. Have a look at the logs just from right
> now (below).
>
> Question: You removed the htime files and the old changelogs. Just rm the
> files or is there something to pay more attention
>
> before removing the changelog files and the htime file.
>
> Regards,
>
> Felix
>
> [2020-06-25 07:51:53.795430] I [resource(worker
> /gluster/vg00/dispersed_fuse1024/brick):1435:connect_remote] SSH: SSH
> connection between master and slave established.    duration=1.2341
> [2020-06-25 07:51:53.795639] I [resource(worker
> /gluster/vg00/dispersed_fuse1024/brick):1105:connect] GLUSTER: Mounting
> gluster volume locally...
> [2020-06-25 07:51:54.520601] I [monitor(monitor):280:monitor] Monitor:
> worker died in startup phase    brick=/gluster/vg01/dispersed_fuse1024/brick
> [2020-06-25 07:51:54.535809] I
> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status
> Change    status=Faulty
> [2020-06-25 07:51:54.882143] I [resource(worker
> /gluster/vg00/dispersed_fuse1024/brick):1128:connect] GLUSTER: Mounted
> gluster volume    duration=1.0864
> [2020-06-25 07:51:54.882388] I [subcmds(worker
> /gluster/vg00/dispersed_fuse1024/brick):84:subcmd_worker] <top>: Worker
> spawn successful. Acknowledging back to monitor
> [2020-06-25 07:51:56.911412] E [repce(agent
> /gluster/vg00/dispersed_fuse1024/brick):121:worker] <top>: call failed:
> Traceback (most recent call last):
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 117, in
> worker
>     res = getattr(self.obj, rmeth)(*in_data[2:])
>   File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line
> 40, in register
>     return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level, retries)
>   File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
> 46, in cl_register
>     cls.raise_changelog_err()
>   File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
> 30, in raise_changelog_err
>     raise ChangelogException(errn, os.strerror(errn))
> ChangelogException: [Errno 2] No such file or directory
> [2020-06-25 07:51:56.912056] E [repce(worker
> /gluster/vg00/dispersed_fuse1024/brick):213:__call__] RepceClient: call
> failed    call=75086:140098349655872:1593071514.91    method=register
> error=ChangelogException
> [2020-06-25 07:51:56.912396] E [resource(worker
> /gluster/vg00/dispersed_fuse1024/brick):1286:service_loop] GLUSTER:
> Changelog register failed    error=[Errno 2] No such file or directory
> [2020-06-25 07:51:56.928031] I [repce(agent
> /gluster/vg00/dispersed_fuse1024/brick):96:service_loop] RepceServer:
> terminating on reaching EOF.
> [2020-06-25 07:51:57.886126] I [monitor(monitor):280:monitor] Monitor:
> worker died in startup phase    brick=/gluster/vg00/dispersed_fuse1024/brick
> [2020-06-25 07:51:57.895920] I
> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status
> Change    status=Faulty
> [2020-06-25 07:51:58.607405] I [gsyncdstatus(worker
> /gluster/vg00/dispersed_fuse1024/brick):287:set_passive] GeorepStatus:
> Worker Status Change    status=Passive
> [2020-06-25 07:51:58.607768] I [gsyncdstatus(worker
> /gluster/vg01/dispersed_fuse1024/brick):287:set_passive] GeorepStatus:
> Worker Status Change    status=Passive
> [2020-06-25 07:51:58.608004] I [gsyncdstatus(worker
> /gluster/vg00/dispersed_fuse1024/brick):281:set_active] GeorepStatus:
> Worker Status Change    status=Active
>
>
> On 25/06/2020 09:15, Rob.Quagliozzi at rabobank.com wrote:
>
> Hi All,
>
>
>
> We’ve got two six node RHEL 7.8 clusters and geo-replication would appear
> to be completely broken between them. I’ve deleted the session, removed &
> recreated pem files, old changlogs/htime (after removing relevant options
> from volume) and completely set up geo-rep from scratch, but the new
> session comes up as Initializing, then goes faulty, and starts looping.
> Volume (on both sides) is a 4 x 2 disperse, running Gluster v6 (RH
> latest).  Gsyncd reports:
>
>
>
> [2020-06-25 07:07:14.701423] I
> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status
> Change status=Initializing...
>
> [2020-06-25 07:07:14.701744] I [monitor(monitor):159:monitor] Monitor:
> starting gsyncd worker   brick=/rhgs/brick20/brick       slave_node=
> bxts470194.eu.rabonet.com
>
> [2020-06-25 07:07:14.707997] D [monitor(monitor):230:monitor] Monitor:
> Worker would mount volume privately
>
> [2020-06-25 07:07:14.757181] I [gsyncd(agent
> /rhgs/brick20/brick):318:main] <top>: Using session config file
> path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>
> [2020-06-25 07:07:14.758126] D [subcmds(agent
> /rhgs/brick20/brick):107:subcmd_agent] <top>: RPC FD
> rpc_fd='5,12,11,10'
>
> [2020-06-25 07:07:14.758627] I [changelogagent(agent
> /rhgs/brick20/brick):72:__init__] ChangelogAgent: Agent listining...
>
> [2020-06-25 07:07:14.764234] I [gsyncd(worker
> /rhgs/brick20/brick):318:main] <top>: Using session config file
> path=/var/lib/glusterd/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/gsyncd.conf
>
> [2020-06-25 07:07:14.779409] I [resource(worker
> /rhgs/brick20/brick):1386:connect_remote] SSH: Initializing SSH connection
> between master and slave...
>
> [2020-06-25 07:07:14.841793] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068834.84 __repce_version__() ...
>
> [2020-06-25 07:07:16.148725] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068834.84 __repce_version__ -> 1.0
>
> [2020-06-25 07:07:16.148911] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068836.15 version() ...
>
> [2020-06-25 07:07:16.149574] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068836.15 version -> 1.0
>
> [2020-06-25 07:07:16.149735] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068836.15 pid() ...
>
> [2020-06-25 07:07:16.150588] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068836.15 pid -> 30703
>
> [2020-06-25 07:07:16.150747] I [resource(worker
> /rhgs/brick20/brick):1435:connect_remote] SSH: SSH connection between
> master and slave established.     duration=1.3712
>
> [2020-06-25 07:07:16.150819] I [resource(worker
> /rhgs/brick20/brick):1105:connect] GLUSTER: Mounting gluster volume
> locally...
>
> [2020-06-25 07:07:16.265860] D [resource(worker
> /rhgs/brick20/brick):879:inhibit] DirectMounter: auxiliary glusterfs mount
> in place
>
> [2020-06-25 07:07:17.272511] D [resource(worker
> /rhgs/brick20/brick):953:inhibit] DirectMounter: auxiliary glusterfs mount
> prepared
>
> [2020-06-25 07:07:17.272708] I [resource(worker
> /rhgs/brick20/brick):1128:connect] GLUSTER: Mounted gluster volume
> duration=1.1218
>
> [2020-06-25 07:07:17.272794] I [subcmds(worker
> /rhgs/brick20/brick):84:subcmd_worker] <top>: Worker spawn successful.
> Acknowledging back to monitor
>
> [2020-06-25 07:07:17.272973] D [master(worker
> /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change
> detection mode mode=xsync
>
> [2020-06-25 07:07:17.273063] D [monitor(monitor):273:monitor] Monitor:
> worker(/rhgs/brick20/brick) connected
>
> [2020-06-25 07:07:17.273678] D [master(worker
> /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change
> detection mode mode=changelog
>
> [2020-06-25 07:07:17.274224] D [master(worker
> /rhgs/brick20/brick):104:gmaster_builder] <top>: setting up change
> detection mode mode=changeloghistory
>
> [2020-06-25 07:07:17.276484] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068837.28 version() ...
>
> [2020-06-25 07:07:17.276916] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068837.28 version -> 1.0
>
> [2020-06-25 07:07:17.277009] D [master(worker
> /rhgs/brick20/brick):777:setup_working_dir] _GMaster: changelog working dir
> /var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick
>
> [2020-06-25 07:07:17.277098] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068837.28 init() ...
>
> [2020-06-25 07:07:17.292944] D [repce(worker
> /rhgs/brick20/brick):215:__call__] RepceClient: call
> 6799:140380783982400:1593068837.28 init -> None
>
> [2020-06-25 07:07:17.293097] D [repce(worker
> /rhgs/brick20/brick):195:push] RepceClient: call
> 6799:140380783982400:1593068837.29 register('/rhgs/brick20/brick',
> '/var/lib/misc/gluster/gsyncd/prd_mx_intvol_bxts470190_prd_mx_intvol/rhgs-brick20-brick',
> '/var/log/glusterfs/geo-replication/prd_mx_intvol_bxts470190_prd_mx_intvol/changes-rhgs-brick20-brick.log',
> 8, 5) ...
>
> [2020-06-25 07:07:19.296294] E [repce(agent
> /rhgs/brick20/brick):121:worker] <top>: call failed:
>
> Traceback (most recent call last):
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 117, in
> worker
>
>     res = getattr(self.obj, rmeth)(*in_data[2:])
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line
> 40, in register
>
>     return Changes.cl_register(cl_brick, cl_dir, cl_log, cl_level, retries)
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
> 46, in cl_register
>
>     cls.raise_changelog_err()
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line
> 30, in raise_changelog_err
>
>     raise ChangelogException(errn, os.strerror(errn))
>
> ChangelogException: [Errno 2] No such file or directory
>
> [2020-06-25 07:07:19.297161] E [repce(worker
> /rhgs/brick20/brick):213:__call__] RepceClient: call failed
> call=6799:140380783982400:1593068837.29 method=register
> error=ChangelogException
>
> [2020-06-25 07:07:19.297338] E [resource(worker
> /rhgs/brick20/brick):1286:service_loop] GLUSTER: Changelog register
> failed      error=[Errno 2] No such file or directory
>
> [2020-06-25 07:07:19.315074] I [repce(agent
> /rhgs/brick20/brick):96:service_loop] RepceServer: terminating on reaching
> EOF.
>
> [2020-06-25 07:07:20.275701] I [monitor(monitor):280:monitor] Monitor:
> worker died in startup phase     brick=/rhgs/brick20/brick
>
> [2020-06-25 07:07:20.277383] I
> [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status
> Change status=Faulty
>
>
>
> We’ve done everything we can think of, including an “strace –f” on the
> pid, and we can’t really find anything. I’m about to lose the last of my
> hair over this, so does anyone have any ideas at all? We’ve even removed
> the entire slave vol and rebuilt it.
>
>
>
> Thanks
>
> Rob
>
>
>
> *Rob Quagliozzi*
>
> *Specialised Application Support*
>
>
>
>
> ------------------------------
> This email (including any attachments to it) is confidential, legally
> privileged, subject to copyright and is sent for the personal attention of
> the intended recipient only. If you have received this email in error,
> please advise us immediately and delete it. You are notified that
> disclosing, copying, distributing or taking any action in reliance on the
> contents of this information is strictly prohibited. Although we have taken
> reasonable precautions to ensure no viruses are present in this email, we
> cannot accept responsibility for any loss or damage arising from the
> viruses in this email or attachments. We exclude any liability for the
> content of this email, or for the consequences of any actions taken on the
> basis of the information provided in this email or its attachments, unless
> that information is subsequently confirmed in writing. <#rbnl#1898i>
> ------------------------------
>
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing listGluster-users at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200625/a7cda701/attachment.html>