[Gluster-users] Geo-replication gets faulty after first sync of files. GlusterFS 3.7.4

Wade Fitzpatrick wade.fitzpatrick at ladbrokes.com.au
Wed Oct 21 03:56:30 UTC 2015


That is almost exactly what I am seeing too. It seems to be a problem 
with geo-replicating a Stripe and/or Replicate volume.

On 21/10/2015 4:02 am, Родион Скрябин wrote:
> I have a long time war with geo-replications. First two battles were won:
> 1. geo-replication on a cluster with glusterfs-3.4.1 (Centos 6.6, xfs: 
> server1-server2 => geo-replica at east-server1 )
> 2. One volume named infra-geo geo-replicated well on a 2nd cluster 
> with glusterfs-3.7.4 (Centos 7.1, ext4: server3-server4 => geo-replica 
> at east-server2)
>
> The last battle lasts for two months, and I need your help to win it :)
>
> Georeplication of second volume, named storage-2 getting faulty after 
> initial sync of data. It could sync all files (620GB last time), and 
> became falty. Restart of every and each gluster service won't affect 
> the result, status is the same - faulty.
> The second volume named storage-2 geo-replicated also on a 2nd cluster 
> with glusterfs-3.7.4 (Centos 7.1, ext4: server3-server4 => geo-replica 
> at east-server2)
>
> The noticeble things:
> (date of creating geo-replica: '2015-10-16 16:03')
> 1. 
> /var/log/glusterfs/geo-replication/storage-2/ssh%3A%2F%2Froot%40172.16.10.43%3Agluster%3A%2F%2F127.0.0.1%3Astorage-2.%2Fmnt%2Fsda4%2Fbrick-storage-3-changes.log-20151018 
> is ordianry and has length of 8.8 MB
>
> The current changelog started at '2015-10-18 17:43:17.117287' has 
> 9.7GB, 34909550 lines of:
> E [mem-pool.c:417:mem_get0] 
> (-->/lib64/libglusterfs.so.0(recursive_rmdir+0x192) [0x7f885735cbb2] 
> -->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f885733f55f] 
> -->/lib64/libglusterfs.so.0(mem_get0+0x81) [0x7f8857371751] ) 
> 0-mem-pool: invalid argument [Invalid argument]
>
> Only 22MB, has something more uniq:
> [2015-10-19 16:42:42.652110] I 
> [gf-changelog.c:542:gf_changelog_register_generic] 0-gfchangelog: 
> Registering brick: /mnt/sda4/brick-storage-3 [notify filter: 1]
> [2015-10-19 16:42:44.106591] T [rpcsvc.c:2298:rpcsvc_init] 
> 0-rpc-service: rx pool: 8
> [2015-10-19 16:42:44.106643] T 
> [rpcsvc-auth.c:119:rpcsvc_auth_init_auth] 0-rpc-service: 
> Authentication enabled: AUTH_GLUSTERFS
> [2015-10-19 16:42:44.106648] T 
> [rpcsvc-auth.c:119:rpcsvc_auth_init_auth] 0-rpc-service: 
> Authentication enabled: AUTH_GLUSTERFS-v2
> [2015-10-19 16:42:44.106656] T 
> [rpcsvc-auth.c:119:rpcsvc_auth_init_auth] 0-rpc-service: 
> Authentication enabled: AUTH_UNIX
> [2015-10-19 16:42:44.106661] T 
> [rpcsvc-auth.c:119:rpcsvc_auth_init_auth] 0-rpc-service: 
> Authentication enabled: AUTH_NULL
> [2015-10-19 16:42:44.106664] D [rpcsvc.c:2317:rpcsvc_init] 
> 0-rpc-service: RPC service inited.
> [2015-10-19 16:42:44.106673] D [rpcsvc.c:1874:rpcsvc_program_register] 
> 0-rpc-service: New program registered: GF-DUMP, Num: 123451501, Ver: 
> 1, Port: 0
> [2015-10-19 16:42:44.106687] D 
> [rpc-transport.c:288:rpc_transport_load] 0-rpc-transport: attempt to 
> load file /usr/lib64/glusterfs/3.7.4/rpc-transport/socket.so
> [2015-10-19 16:42:44.107068] D [socket.c:3794:socket_init] 
> 0-socket.gfchangelog: disabling nodelay
> [2015-10-19 16:42:44.107076] D [socket.c:3845:socket_init] 
> 0-socket.gfchangelog: Configued transport.tcp-user-timeout=0
> [2015-10-19 16:42:44.107081] D [socket.c:3928:socket_init] 
> 0-socket.gfchangelog: SSL support on the I/O path is NOT enabled
> [2015-10-19 16:42:44.107084] D [socket.c:3931:socket_init] 
> 0-socket.gfchangelog: SSL support for glusterd is NOT enabled
> [2015-10-19 16:42:44.107093] D [socket.c:3948:socket_init] 
> 0-socket.gfchangelog: using system polling thread
> [2015-10-19 16:42:44.107159] D [rpcsvc.c:1874:rpcsvc_program_register] 
> 0-rpc-service: New program registered: LIBGFCHANGELOG REBORP, Num: 
> 1886350951, Ver: 1, Port: 0
> [2015-10-19 16:42:44.107182] D 
> [rpc-clnt.c:989:rpc_clnt_connection_init] 0-gfchangelog: defaulting 
> frame-timeout to 30mins
> [2015-10-19 16:42:44.107192] D 
> [rpc-clnt.c:1003:rpc_clnt_connection_init] 0-gfchangelog: disable 
> ping-timeout
> [2015-10-19 16:42:44.107200] D 
> [rpc-transport.c:288:rpc_transport_load] 0-rpc-transport: attempt to 
> load file /usr/lib64/glusterfs/3.7.4/rpc-transport/socket.so
> [2015-10-19 16:42:44.107233] D [socket.c:3794:socket_init] 
> 0-gfchangelog: disabling nodelay
> [2015-10-19 16:42:44.107238] D [socket.c:3845:socket_init] 
> 0-gfchangelog: Configued transport.tcp-user-timeout=0
> [2015-10-19 16:42:44.107242] D [socket.c:3928:socket_init] 
> 0-gfchangelog: SSL support on the I/O path is NOT enabled
> [2015-10-19 16:42:44.107248] D [socket.c:3931:socket_init] 
> 0-gfchangelog: SSL support for glusterd is NOT enabled
> [2015-10-19 16:42:44.107255] D [socket.c:3948:socket_init] 
> 0-gfchangelog: using system polling thread
> [2015-10-19 16:42:44.107263] T [rpc-clnt.c:418:rpc_clnt_reconnect] 
> 0-gfchangelog: attempting reconnect
> [2015-10-19 16:42:44.107267] T [socket.c:2887:socket_connect] 
> 0-gfchangelog: connecting 0x7f2dbc08c7e0, state=0 gen=0 sock=-1
> [2015-10-19 16:42:44.107272] T 
> [name.c:295:af_unix_client_get_remote_sockaddr] 0-gfchangelog: using 
> connect-path 
> /var/run/gluster/changelog-535626c1c4cea4957a54c920f955b1ac.sock
> [2015-10-19 16:42:44.107284] T [name.c:111:af_unix_client_bind] 
> 0-gfchangelog: bind-path not specified for unix socket, letting 
> connect to assign default value
> [2015-10-19 16:42:46.107494] T [rpc-clnt.c:1406:rpc_clnt_record] 
> 0-gfchangelog: Auth Info: pid: 0, uid: 0, gid: 0, owner:
> [2015-10-19 16:42:46.107543] T 
> [rpc-clnt.c:1263:rpc_clnt_record_build_header] 0-rpc-clnt: Request 
> fraglen 500, payload: 436, rpc hdr: 64
> [2015-10-19 16:42:46.107589] T [rpc-clnt.c:1600:rpc_clnt_submit] 
> 0-rpc-clnt: submitted request (XID: 0x1 Program: LIBGFCHANGELOG, 
> ProgVers: 1, Proc: 1) to rpc-transport (gfchangelog)
> [2015-10-19 16:42:46.107618] D 
> [rpc-clnt-ping.c:281:rpc_clnt_start_ping] 0-gfchangelog: ping timeout 
> is 0, returning
> [2015-10-19 16:42:46.109489] T [rpc-clnt.c:663:rpc_clnt_reply_init] 
> 0-gfchangelog: received rpc message (RPC XID: 0x1 Program: 
> LIBGFCHANGELOG, ProgVers: 1, Proc: 1) from rpc-transport (gfchangelog)
> [2015-10-19 16:42:46.146891] I 
> [gf-history-changelog.c:760:gf_changelog_extract_min_max] 
> 0-gfchangelog: MIN: 1439390689, MAX: 1445272962, TOTAL CHANGELOGS: 391739
> [2015-10-19 16:42:46.147039] I 
> [gf-history-changelog.c:907:gf_history_changelog] 0-gfchangelog: 
> FINAL: from: 1445023075, to: 1445272962, changes: 16643
> [2015-10-19 16:42:46.148604] D 
> [gf-history-changelog.c:298:gf_history_changelog_scan] 0-gfchangelog: 
> hist_done 1, is_last_scan: 0
>
>
>
> 2. 
>  /var/log/glusterfs/geo-replication/storage-2/ssh%3A%2F%2Froot%40172.16.10.43%3Agluster%3A%2F%2F127.0.0.1%3Astorage-2.log 
>
> [2015-10-18 10:42:44.890366] W 
> [master(/mnt/sda4/brick-storage-3):1041:process] _GMaster: incomplete 
> sync, retrying changelogs: CHANGELOG.1445022009 CHANGELOG.1445022024 
> CHANGELOG.1445022039 CHANGELOG.1445022054 CHANGELOG.1445022069 CHAN
> GELOG.1445022084 CHANGELOG.1445022099 CHANGELOG.1445022114 
> CHANGELOG.1445022129 CHANGELOG.1445022144 CHANGELOG.1445022159 
> CHANGELOG.1445022174 CHANGELOG.1445022189 CHANGELOG.1445022204 
> CHANGELOG.1445022219 CHANGELOG.1445022234 CHANGELOG.
> 1445022249 CHANGELOG.1445022264 CHANGELOG.1445022279 
> CHANGELOG.1445022294 CHANGELOG.1445022309 CHANGELOG.1445022324 
> CHANGELOG.1445022339 CHANGELOG.1445022354 CHANGELOG.1445022369 
> CHANGELOG.1445022384 CHANGELOG.1445022399 CHANGELOG.144502
> 2414 CHANGELOG.1445022429 CHANGELOG.1445022444 CHANGELOG.1445022459 
> CHANGELOG.1445022474 CHANGELOG.1445022489 CHANGELOG.1445022504 
> CHANGELOG.1445022519 CHANGELOG.1445022534 CHANGELOG.1445022549 
> CHANGELOG.1445022564 CHANGELOG.1445022579 C
> HANGELOG.1445022594 CHANGELOG.1445022609 CHANGELOG.1445022624 
> CHANGELOG.1445022639 CHANGELOG.1445022654 CHANGELOG.1445022669 
> CHANGELOG.1445022684 CHANGELOG.1445022699 CHANGELOG.1445022714 
> CHANGELOG.1445022729 CHANGELOG.1445022744 CHANGEL
> OG.1445022759 CHANGELOG.1445022774 CHANGELOG.1445022789 
> CHANGELOG.1445022805 CHANGELOG.1445022820 CHANGELOG.1445022835 
> CHANGELOG.1445022850 CHANGELOG.1445022865 CHANGELOG.1445022880 
> CHANGELOG.1445022895 CHANGELOG.1445022910 CHANGELOG.144
> 5022925 CHANGELOG.1445022940 CHANGELOG.1445022955 CHANGELOG.1445022970 
> CHANGELOG.1445022985 CHANGELOG.1445023000 CHANGELOG.1445023015 
> CHANGELOG.1445023030 CHANGELOG.1445023045 CHANGELOG.1445023060 
> CHANGELOG.1445023075 CHANGELOG.144502309
> 0 CHANGELOG.1445023105 CHANGELOG.1445023120 CHANGELOG.1445023135 
> CHANGELOG.1445023150 CHANGELOG.1445023165 CHANGELOG.1445023180 
> CHANGELOG.1445023195 CHANGELOG.1445023210 CHANGELOG.1445023225 
> CHANGELOG.1445023240 CHANGELOG.1445023255 CHAN
> GELOG.1445023270 CHANGELOG.1445023285 CHANGELOG.1445023300 
> CHANGELOG.1445023315 CHANGELOG.1445023330 CHANGELOG.1445023345 
> CHANGELOG.1445023360 CHANGELOG.1445023375 CHANGELOG.1445023390 
> CHANGELOG.1445023405 CHANGELOG.1445023420 CHANGELOG.
> 1445023435 CHANGELOG.1445023450 CHANGELOG.1445023465 
> CHANGELOG.1445023480 CHANGELOG.1445023495 CHANGELOG.1445023510 
> CHANGELOG.1445023525 CHANGELOG.1445023540 CHANGELOG.1445023555 
> CHANGELOG.1445023570 CHANGELOG.1445023585 CHANGELOG.144502
> 3600 CHANGELOG.1445023615 CHANGELOG.1445023630 CHANGELOG.1445023645 
> CHANGELOG.1445023660 CHANGELOG.1445023675 CHANGELOG.1445023690 
> CHANGELOG.1445023705 CHANGELOG.1445023720 CHANGELOG.1445023735 
> CHANGELOG.1445023750 CHANGELOG.1445023765 C
> HANGELOG.1445023780 CHANGELOG.1445023796 CHANGELOG.1445023811 
> CHANGELOG.1445023826 CHANGELOG.1445023841 CHANGELOG.1445023856 
> CHANGELOG.1445023871 CHANGELOG.1445023886 CHANGELOG.1445023901 
> CHANGELOG.1445023916 CHANGELOG.1445023931 CHANGEL
> OG.1445023946 CHANGELOG.1445023961 CHANGELOG.1445023976 
> CHANGELOG.1445023991 CHANGELOG.1445024006 CHANGELOG.1445024021 
> CHANGELOG.1445024036 CHANGELOG.1445024051 CHANGELOG.1445024066 
> CHANGELOG.1445024081 CHANGELOG.1445024096 CHANGELOG.144
> 5024111 CHANGELOG.1445024126 CHANGELOG.1445024141 CHANGELOG.1445024156 
> CHANGELOG.1445024171 CHANGELOG.1445024186 CHANGELOG.1445024201 
> CHANGELOG.1445024216 CHANGELOG.1445024231 CHANGELOG.1445024246 
> CHANGELOG.1445024261 CHANGELOG.144502427
> 6 CHANGELOG.1445024291 CHANGELOG.1445024306 CHANGELOG.1445024321 
> CHANGELOG.1445024336 CHANGELOG.1445024351 CHANGELOG.1445024366 
> CHANGELOG.1445024381 CHANGELOG.1445024396 CHANGELOG.1445024411
> [2015-10-18 10:43:02.808736] E 
> [repce(/mnt/sda4/brick-storage-3):207:__call__] RepceClient: call 
> 24010:140490734180160:1445190182.72 (entry_ops) failed on peer with 
> OSError
> [2015-10-18 10:43:02.809122] E 
> [syncdutils(/mnt/sda4/brick-storage-3):276:log_raise_exception] <top>: 
> FAIL:
> Traceback (most recent call last):
>   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, 
> in main
>     main_i()
>   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, 
> in main_i
>     local.service_loop(*[r for r in [remote] if r])
>   File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 
> 1451, in service_loop
>     g2.crawlwrap()
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 591, 
> in crawlwrap
>     self.crawl()
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 
> 1106, in crawl
>     self.changelogs_batch_process(changes)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 
> 1081, in changelogs_batch_process
>     self.process(batch)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 959, 
> in process
>     self.process_change(change, done, retry)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 914, 
> in process_change
>     failures = self.slave.server.entry_ops(entries)
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, 
> in __call__
>     return self.ins(self.meth, *a)
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, 
> in __call__
>     raise res
> OSError: [Errno 16] Device or resource busy
> [2015-10-18 10:43:02.811510] I 
> [syncdutils(/mnt/sda4/brick-storage-3):220:finalize] <top>: exiting.
> [2015-10-18 10:43:02.819149] I [repce(agent):92:service_loop] 
> RepceServer: terminating on reaching EOF.
> [2015-10-18 10:43:02.819616] I [syncdutils(agent):220:finalize] <top>: 
> exiting.
>
> Since that I get the repeating error:
> [2015-10-18 10:43:13.73946] I [monitor(monitor):221:monitor] Monitor: 
> ------------------------------------------------------------
> [2015-10-18 10:43:13.74423] I [monitor(monitor):222:monitor] Monitor: 
> starting gsyncd worker
> [2015-10-18 10:43:13.177963] I [changelogagent(agent):75:__init__] 
> ChangelogAgent: Agent listining...
> [2015-10-18 10:43:13.178881] I 
> [gsyncd(/mnt/sda4/brick-storage-3):649:main_i] <top>: syncing: 
> gluster://localhost:storage-2 -> 
> ssh://root@east-storage2:gluster://localhost:storage-2
> [2015-10-18 10:43:17.93168] I 
> [master(/mnt/sda4/brick-storage-3):83:gmaster_builder] <top>: setting 
> up xsync change detection mode
> [2015-10-18 10:43:17.93669] I 
> [master(/mnt/sda4/brick-storage-3):404:__init__] _GMaster: using 
> 'rsync' as the sync engine
> [2015-10-18 10:43:17.95079] I 
> [master(/mnt/sda4/brick-storage-3):83:gmaster_builder] <top>: setting 
> up changelog change detection mode
> [2015-10-18 10:43:17.95390] I 
> [master(/mnt/sda4/brick-storage-3):404:__init__] _GMaster: using 
> 'rsync' as the sync engine
> [2015-10-18 10:43:17.96730] I 
> [master(/mnt/sda4/brick-storage-3):83:gmaster_builder] <top>: setting 
> up changeloghistory change detection mode
> [2015-10-18 10:43:17.97048] I 
> [master(/mnt/sda4/brick-storage-3):404:__init__] _GMaster: using 
> 'rsync' as the sync engine
> [2015-10-18 10:43:19.915657] I 
> [master(/mnt/sda4/brick-storage-3):1220:register] _GMaster: xsync temp 
> directory: 
> /var/lib/misc/glusterfsd/storage-2/ssh%3A%2F%2Froot%40172.16.10.43%3Agluster%3A%2F%2F127.0.0.1%3Astorage-2/535626c1c4cea4957a
> 54c920f955b1ac/xsync
> [2015-10-18 10:43:19.916011] I 
> [resource(/mnt/sda4/brick-storage-3):1432:service_loop] GLUSTER: 
> Register time: 1445190199
> [2015-10-18 10:43:19.935817] I 
> [master(/mnt/sda4/brick-storage-3):530:crawlwrap] _GMaster: primary 
> master with volume id 054d6225-39fc-40f5-9604-5ee4b6dcd8b4 ...
> [2015-10-18 10:43:19.938358] I 
> [master(/mnt/sda4/brick-storage-3):539:crawlwrap] _GMaster: crawl 
> interval: 1 seconds
> [2015-10-18 10:43:19.942023] I 
> [master(/mnt/sda4/brick-storage-3):1135:crawl] _GMaster: starting 
> history crawl... turns: 1, stime: (1445021993, 0), etime: 1445190199
> [2015-10-18 10:43:20.975270] I 
> [master(/mnt/sda4/brick-storage-3):1164:crawl] _GMaster: slave's time: 
> (1445021993, 0)
> [2015-10-18 10:43:38.50978] E 
> [repce(/mnt/sda4/brick-storage-3):207:__call__] RepceClient: call 
> 13364:140337327839040:1445190217.96 (entry_ops) failed on peer with 
> OSError
> [2015-10-18 10:43:38.51417] E 
> [syncdutils(/mnt/sda4/brick-storage-3):276:log_raise_exception] <top>: 
> FAIL:
> Traceback (most recent call last):
>   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, 
> in main
>     main_i()
>   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, 
> in main_i
>     local.service_loop(*[r for r in [remote] if r])
>   File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 
> 1438, in service_loop
>     g3.crawlwrap(oneshot=True)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 591, 
> in crawlwrap
>     self.crawl()
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 
> 1173, in crawl
>     self.changelogs_batch_process(changes)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 
> 1081, in changelogs_batch_process
>     self.process(batch)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 959, 
> in process
>     self.process_change(change, done, retry)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 914, 
> in process_change
>     failures = self.slave.server.entry_ops(entries)
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, 
> in __call__
>     return self.ins(self.meth, *a)
>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, 
> in __call__
>     raise res
> OSError: [Errno 16] Device or resource busy
> [2015-10-18 10:43:38.53669] I 
> [syncdutils(/mnt/sda4/brick-storage-3):220:finalize] <top>: exiting.
> [2015-10-18 10:43:38.56446] I [repce(agent):92:service_loop] 
> RepceServer: terminating on reaching EOF.
> [2015-10-18 10:43:38.56844] I [syncdutils(agent):220:finalize] <top>: 
> exiting.
>
> Any suggestions would be very appreciated, thanks
>
> With regards to all community,
> --
> Rodion
>
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151021/382b31d5/attachment.html>


More information about the Gluster-users mailing list