[Gluster-users] [geo-rep] Replication faulty - gsyncd.log OSError: [Errno 13] Permission denied

Sat Sep 22 04:52:37 UTC 2018

The problem occured on slave side whose error is propagated to master.
Mostly any traceback with repce involved is related to problem in slave.
Just check few lines above in the log to find the slave node, the crashed
worker is connected to and get geo replication logs to further debug.

On Fri, 21 Sep 2018, 20:10 Kotte, Christian (Ext), <
christian.kotte at novartis.com> wrote:

> Hi,
>
>
>
> Any idea how to troubleshoot this?
>
>
>
> New folders and files were created on the master and the replication went
> faulty. They were created via Samba.
>
>
>
> Version: GlusterFS 4.1.3
>
>
>
> [root at master]# gluster volume geo-replication status
>
>
>
> MASTER NODE                         MASTER VOL     MASTER BRICK
> SLAVE USER
> SLAVE                                                             SLAVE
> NODE    STATUS    CRAWL STATUS    LAST_SYNCED
>
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> master    glustervol1    /bricks/brick1/brick    geoaccount
> ssh://geoaccount@slave_1::glustervol1       N/A           Faulty
> N/A             N/A
>
> master    glustervol1    /bricks/brick1/brick    geoaccount
> ssh://geoaccount@slave_2::glustervol1       N/A           Faulty
> N/A             N/A
>
> master    glustervol1    /bricks/brick1/brick    geoaccount
> ssh://geoaccount@interimmaster::glustervol1   N/A           Faulty
> N/A             N/A
>
>
>
> The following error is repeatedly logged in the gsyncd.logs:
>
> [2018-09-21 14:26:38.611479] I [repce(agent
> /bricks/brick1/brick):80:service_loop] RepceServer: terminating on reaching
> EOF.
>
> [2018-09-21 14:26:39.211527] I [monitor(monitor):279:monitor] Monitor:
> worker died in startup phase     brick=/bricks/brick1/brick
>
> [2018-09-21 14:26:39.214322] I
> [gsyncdstatus(monitor):244:set_worker_status] GeorepStatus: Worker Status
> Change status=Faulty
>
> [2018-09-21 14:26:49.318953] I [monitor(monitor):158:monitor] Monitor:
> starting gsyncd worker   brick=/bricks/brick1/brick      slave_node=
> nrchbs-slp2020.nibr.novartis.net
>
> [2018-09-21 14:26:49.471532] I [gsyncd(agent
> /bricks/brick1/brick):297:main] <top>: Using session config file
> path=/var/lib/glusterd/geo-replication/glustervol1_nrchbs-slp2020.nibr.novartis.net_glustervol1/gsyncd.conf
>
> [2018-09-21 14:26:49.473917] I [changelogagent(agent
> /bricks/brick1/brick):72:__init__] ChangelogAgent: Agent listining...
>
> [2018-09-21 14:26:49.491359] I [gsyncd(worker
> /bricks/brick1/brick):297:main] <top>: Using session config file
> path=/var/lib/glusterd/geo-replication/glustervol1_nrchbs-slp2020.nibr.novartis.net_glustervol1/gsyncd.conf
>
> [2018-09-21 14:26:49.538049] I [resource(worker
> /bricks/brick1/brick):1377:connect_remote] SSH: Initializing SSH connection
> between master and slave...
>
> [2018-09-21 14:26:53.5017] I [resource(worker
> /bricks/brick1/brick):1424:connect_remote] SSH: SSH connection between
> master and slave established.      duration=3.4665
>
> [2018-09-21 14:26:53.5419] I [resource(worker
> /bricks/brick1/brick):1096:connect] GLUSTER: Mounting gluster volume
> locally...
>
> [2018-09-21 14:26:54.120374] I [resource(worker
> /bricks/brick1/brick):1119:connect] GLUSTER: Mounted gluster volume
> duration=1.1146
>
> [2018-09-21 14:26:54.121012] I [subcmds(worker
> /bricks/brick1/brick):70:subcmd_worker] <top>: Worker spawn successful.
> Acknowledging back to monitor
>
> [2018-09-21 14:26:56.144460] I [master(worker
> /bricks/brick1/brick):1593:register] _GMaster: Working dir
> path=/var/lib/misc/gluster/gsyncd/glustervol1_nrchbs-slp2020.nibr.novartis.net_glustervol1/bricks-brick1-brick
>
> [2018-09-21 14:26:56.145145] I [resource(worker
> /bricks/brick1/brick):1282:service_loop] GLUSTER: Register time
> time=1537540016
>
> [2018-09-21 14:26:56.160064] I [gsyncdstatus(worker
> /bricks/brick1/brick):277:set_active] GeorepStatus: Worker Status Change
> status=Active
>
> [2018-09-21 14:26:56.161175] I [gsyncdstatus(worker
> /bricks/brick1/brick):249:set_worker_crawl_status] GeorepStatus: Crawl
> Status Change        status=History Crawl
>
> [2018-09-21 14:26:56.161536] I [master(worker
> /bricks/brick1/brick):1507:crawl] _GMaster: starting history crawl
> turns=1 stime=(1537522637, 0)   entry_stime=(1537537141, 0)
> etime=1537540016
>
> [2018-09-21 14:26:56.164277] I [master(worker
> /bricks/brick1/brick):1536:crawl] _GMaster: slave's time
> stime=(1537522637, 0)
>
> [2018-09-21 14:26:56.197065] I [master(worker
> /bricks/brick1/brick):1360:process] _GMaster: Skipping already processed
> entry ops        to_changelog=1537522638 num_changelogs=1
> from_changelog=1537522638
>
> [2018-09-21 14:26:56.197402] I [master(worker
> /bricks/brick1/brick):1374:process] _GMaster: Entry Time Taken    MKD=0
> MKN=0   LIN=0   SYM=0   REN=0   RMD=0   CRE=0   duration=0.0000 UNL=1
>
> [2018-09-21 14:26:56.197623] I [master(worker
> /bricks/brick1/brick):1384:process] _GMaster: Data/Metadata Time Taken
> SETA=0  SETX=0  meta_duration=0.0000    data_duration=0.0284    DATA=0
> XATT=0
>
> [2018-09-21 14:26:56.198230] I [master(worker
> /bricks/brick1/brick):1394:process] _GMaster: Batch Completed
> changelog_end=1537522638        entry_stime=(1537537141, 0)
> changelog_start=1537522638      stime=(1537522637, 0)   duration=0.0333
> num_changelogs=1        mode=history_changelog
>
> [2018-09-21 14:26:57.200436] I [master(worker
> /bricks/brick1/brick):1536:crawl] _GMaster: slave's time
> stime=(1537522637, 0)
>
> [2018-09-21 14:26:57.528625] E [repce(worker
> /bricks/brick1/brick):197:__call__] RepceClient: call failed
> call=17209:140650361157440:1537540017.21        method=entry_ops
> error=OSError
>
> [2018-09-21 14:26:57.529371] E [syncdutils(worker
> /bricks/brick1/brick):332:log_raise_exception] <top>: FAIL:
>
> Traceback (most recent call last):
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 311, in
> main
>
>     func(args)
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/subcmds.py", line 72, in
> subcmd_worker
>
>     local.service_loop(remote)
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1288,
> in service_loop
>
>     g3.crawlwrap(oneshot=True)
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 615, in
> crawlwrap
>
>     self.crawl()
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1545, in
> crawl
>
>     self.changelogs_batch_process(changes)
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1445, in
> changelogs_batch_process
>
>     self.process(batch)
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1280, in
> process
>
>     self.process_change(change, done, retry)
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1179, in
> process_change
>
>     failures = self.slave.server.entry_ops(entries)
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 216, in
> __call__
>
>     return self.ins(self.meth, *a)
>
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 198, in
> __call__
>
>     raise res
>
> OSError: [Errno 13] Permission denied:
> '/bricks/brick1/brick/.glusterfs/29/d1/29d1d60d-1ad6-45fc-87e0-93d478f7331e'
>
>
>
> The permissions look fine. The replication is done via geo user instead of
> root. It should be able to read, but I’m not sure if the syncdaemon runs
> under geoaccount!?
>
>
>
> [root at master vRealize Operation Manager]# ll
> /bricks/brick1/brick/.glusterfs/29/d1/29d1d60d-1ad6-45fc-87e0-93d478f7331e
>
> lrwxrwxrwx. 1 root root 75 Sep 21 09:39
> /bricks/brick1/brick/.glusterfs/29/d1/29d1d60d-1ad6-45fc-87e0-93d478f7331e
> -> ../../6b/97/6b97b987-8aef-46c3-af27-20d3aa883016/vRealize Operation
> Manager
>
>
>
> [root at master vRealize Operation Manager]# ll
> /bricks/brick1/brick/.glusterfs/29/d1/29d1d60d-1ad6-45fc-87e0-93d478f7331e/
>
> total 4
>
> drwxrwxr-x. 2 AD+user AD+group  131 Sep 21 10:14 6.7
>
> drwxrwxr-x. 2 AD+user AD+group 4096 Sep 21 09:43 7.0
>
> drwxrwxr-x. 2 AD+user AD+group   57 Sep 21 10:28 7.5
>
> [root at master vRealize Operation Manager]#
>
>
>
> It could be possible that the folder was renamed. I had 3 similar issues
> since I migrated to GlusterFS 4.x but couldn’t investigate much. I needed
> to completely wipe GlusterFS and geo-repliction to get rid of this error…
>
>
>
> Any help is appreciated.
>
>
>
> Regards,
>
>
>
> *Christian Kotte*
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180922/f7e83e63/attachment.html>