[Gluster-users] Geo Replication OSError: [Errno 107] Transport endpoint is not connected

Thu Oct 25 12:10:40 UTC 2018

I've a big Problem.

If I start geo-replication everything seems fine, but after replicating 
2.5TB I got errors, it's starting over an over again with the same errors.

I've two nodes with a replicated volume and a third arbiter node.
The destination node is a single node.
The firewall between all nodes ist open.

Master Log

[2018-10-25 07:08:59.619699] D 
[master(/gluster/owncloud/brick2):1665:Xcrawl] _GMaster: entering 
./data/fa/files/backup/research/projects/2011-Regularity/2012-03-Gain-of-Regularity-linearWFP
[2018-10-25 07:08:59.619874] E 
[syncdutils(/gluster/owncloud/brick2):325:log_raise_exception] <top>: 
glusterfs session went down        error=ENOTCONN
[2018-10-25 07:08:59.620109] E 
[syncdutils(/gluster/owncloud/brick2):331:log_raise_exception] <top>: 
FULL EXCEPTION TRACE:
Traceback (most recent call last):
   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 210, 
in main
     main_i()
   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 801, 
in main_i
     local.service_loop(*[r for r in [remote] if r])
   File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 
1679, in service_loop
     g1.crawlwrap(oneshot=True, register_time=register_time)
   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 597, 
in crawlwrap
     self.crawl()
   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1555, 
in crawl
     self.process([item[1]], 0)
   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1204, 
in process
     self.process_change(change, done, retry)
   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1143, 
in process_change
     st = lstat(go[0])
   File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 
553, in lstat
     return errno_wrap(os.lstat, [e], [ENOENT], [ESTALE, EBUSY])
   File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 
535, in errno_wrap
     return call(*arg)
OSError: [Errno 107] Transport endpoint is not connected: 
'.gfid/5c143d64-165f-44b1-98ed-71e491376a76'
[2018-10-25 07:08:59.627846] D 
[master(/gluster/owncloud/brick2):1665:Xcrawl] _GMaster: entering 
./data/fa/files/backup/research/projects/2011-Regularity/resources
[2018-10-25 07:08:59.632826] D 
[master(/gluster/owncloud/brick2):1665:Xcrawl] _GMaster: entering 
./data/fa/files/backup/research/projects/2011-Regularity/add material
[2018-10-25 07:08:59.633582] D 
[master(/gluster/owncloud/brick2):1665:Xcrawl] _GMaster: entering 
./data/fa/files/backup/research/projects/2011-Regularity/add material/Maple
[2018-10-25 07:08:59.636306] D 
[master(/gluster/owncloud/brick2):1665:Xcrawl] _GMaster: entering 
./data/fa/files/backup/research/projects/2011-Regularity/add material/notes
[2018-10-25 07:08:59.637303] I 
[syncdutils(/gluster/owncloud/brick2):271:finalize] <top>: exiting.
[2018-10-25 07:08:59.640778] I 
[repce(/gluster/owncloud/brick2):92:service_loop] RepceServer: 
terminating on reaching EOF.
[2018-10-25 07:08:59.641222] I 
[syncdutils(/gluster/owncloud/brick2):271:finalize] <top>: exiting.
[2018-10-25 07:09:00.314140] I [monitor(monitor):363:monitor] Monitor: 
worker died in startup phase brick=/gluster/owncloud/brick2
[2018-10-25 07:09:00.315172] I 
[gsyncdstatus(monitor):243:set_worker_status] GeorepStatus: Worker 
Status Change status=Faulty

Slave Log

[2018-10-25 07:08:44.206372] I [resource(slave):1502:connect] GLUSTER: 
Mounting gluster volume locally...
[2018-10-25 07:08:45.229620] I [resource(slave):1515:connect] GLUSTER: 
Mounted gluster volume   duration=1.0229
[2018-10-25 07:08:45.230180] I [resource(slave):1012:service_loop] 
GLUSTER: slave listening
[2018-10-25 07:08:59.641242] I [repce(slave):92:service_loop] 
RepceServer: terminating on reaching EOF.
[2018-10-25 07:08:59.655611] I [syncdutils(slave):271:finalize] <top>: 
exiting.

Volume Info

Volume Name: datacloud
Type: Replicate
Volume ID: 6cc79599-7a5c-4b02-bd86-13020a9d91db
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 172.17.45.11:/gluster/datacloud/brick2
Brick2: 172.17.45.12:/gluster/datacloud/brick2
Brick3: 172.17.45.13:/gluster/datacloud/brick2 (arbiter)
Options Reconfigured:
cluster.server-quorum-type: server
cluster.shd-max-threads: 32
cluster.self-heal-readdir-size: 64KB
cluster.quorum-type: fixed
transport.address-family: inet
diagnostics.brick-log-level: INFO
changelog.capture-del-path: on
storage.build-pgfid: on
changelog.changelog: on
geo-replication.ignore-pid-check: on
server.statedump-path: /tmp/gluster
cluster.self-heal-window-size: 32
geo-replication.indexing: on
nfs.trusted-sync: off
diagnostics.dump-fd-stats: off
nfs.disable: on
cluster.self-heal-daemon: enable
cluster.background-self-heal-count: 16
cluster.heal-timeout: 120
cluster.data-self-heal-algorithm: full
cluster.consistent-metadata: on
network.ping-timeout: 20
cluster.granular-entry-heal: enable
cluster.server-quorum-ratio: 51%
cluster.enable-shared-storage: enable

Best regards,
Michael

-- 
Michael Roth  | michael.roth at tuwien.ac.at
IT Solutions - Application Management
Technische Universität Wien - Operngasse 11, 1040 Wien
T +43-1-58801-42091