[Gluster-users] Geo-replication gets faulty after first sync of files. GlusterFS 3.7.4

Родион Скрябин r0di0n.ss at gmail.com
Tue Oct 20 18:02:49 UTC 2015


I have a long time war with geo-replications. First two battles were won:
1. geo-replication on a cluster with glusterfs-3.4.1 (Centos 6.6, xfs:
server1-server2 => geo-replica at east-server1 )
2. One volume named infra-geo geo-replicated well on a 2nd cluster with
glusterfs-3.7.4 (Centos 7.1, ext4: server3-server4 => geo-replica at
east-server2)

The last battle lasts for two months, and I need your help to win it :)

Georeplication of second volume, named storage-2 getting faulty after
initial sync of data. It could sync all files (620GB last time), and became
falty. Restart of every and each gluster service won't affect the result,
status is the same - faulty.
The second volume named storage-2 geo-replicated also on a 2nd cluster with
glusterfs-3.7.4 (Centos 7.1, ext4: server3-server4 => geo-replica at
east-server2)

The noticeble things:
(date of creating geo-replica: '2015-10-16 16:03')
1.
/var/log/glusterfs/geo-replication/storage-2/ssh%3A%2F%2Froot%40172.16.10.43%3Agluster%3A%2F%2F127.0.0.1%3Astorage-2.%2Fmnt%2Fsda4%2Fbrick-storage-3-changes.log-20151018
is ordianry and has length of 8.8 MB

The current changelog started at '2015-10-18 17:43:17.117287' has 9.7GB,
34909550 lines of:
E [mem-pool.c:417:mem_get0]
(-->/lib64/libglusterfs.so.0(recursive_rmdir+0x192) [0x7f885735cbb2]
-->/lib64/libglusterfs.so.0(_gf_msg+0x79f) [0x7f885733f55f]
-->/lib64/libglusterfs.so.0(mem_get0+0x81) [0x7f8857371751] ) 0-mem-pool:
invalid argument [Invalid argument]

Only 22MB, has something more uniq:
[2015-10-19 16:42:42.652110] I
[gf-changelog.c:542:gf_changelog_register_generic] 0-gfchangelog:
Registering brick: /mnt/sda4/brick-storage-3 [notify filter: 1]
[2015-10-19 16:42:44.106591] T [rpcsvc.c:2298:rpcsvc_init] 0-rpc-service:
rx pool: 8
[2015-10-19 16:42:44.106643] T [rpcsvc-auth.c:119:rpcsvc_auth_init_auth]
0-rpc-service: Authentication enabled: AUTH_GLUSTERFS
[2015-10-19 16:42:44.106648] T [rpcsvc-auth.c:119:rpcsvc_auth_init_auth]
0-rpc-service: Authentication enabled: AUTH_GLUSTERFS-v2
[2015-10-19 16:42:44.106656] T [rpcsvc-auth.c:119:rpcsvc_auth_init_auth]
0-rpc-service: Authentication enabled: AUTH_UNIX
[2015-10-19 16:42:44.106661] T [rpcsvc-auth.c:119:rpcsvc_auth_init_auth]
0-rpc-service: Authentication enabled: AUTH_NULL
[2015-10-19 16:42:44.106664] D [rpcsvc.c:2317:rpcsvc_init] 0-rpc-service:
RPC service inited.
[2015-10-19 16:42:44.106673] D [rpcsvc.c:1874:rpcsvc_program_register]
0-rpc-service: New program registered: GF-DUMP, Num: 123451501, Ver: 1,
Port: 0
[2015-10-19 16:42:44.106687] D [rpc-transport.c:288:rpc_transport_load]
0-rpc-transport: attempt to load file
/usr/lib64/glusterfs/3.7.4/rpc-transport/socket.so
[2015-10-19 16:42:44.107068] D [socket.c:3794:socket_init]
0-socket.gfchangelog: disabling nodelay
[2015-10-19 16:42:44.107076] D [socket.c:3845:socket_init]
0-socket.gfchangelog: Configued transport.tcp-user-timeout=0
[2015-10-19 16:42:44.107081] D [socket.c:3928:socket_init]
0-socket.gfchangelog: SSL support on the I/O path is NOT enabled
[2015-10-19 16:42:44.107084] D [socket.c:3931:socket_init]
0-socket.gfchangelog: SSL support for glusterd is NOT enabled
[2015-10-19 16:42:44.107093] D [socket.c:3948:socket_init]
0-socket.gfchangelog: using system polling thread
[2015-10-19 16:42:44.107159] D [rpcsvc.c:1874:rpcsvc_program_register]
0-rpc-service: New program registered: LIBGFCHANGELOG REBORP, Num:
1886350951, Ver: 1, Port: 0
[2015-10-19 16:42:44.107182] D [rpc-clnt.c:989:rpc_clnt_connection_init]
0-gfchangelog: defaulting frame-timeout to 30mins
[2015-10-19 16:42:44.107192] D [rpc-clnt.c:1003:rpc_clnt_connection_init]
0-gfchangelog: disable ping-timeout
[2015-10-19 16:42:44.107200] D [rpc-transport.c:288:rpc_transport_load]
0-rpc-transport: attempt to load file
/usr/lib64/glusterfs/3.7.4/rpc-transport/socket.so
[2015-10-19 16:42:44.107233] D [socket.c:3794:socket_init] 0-gfchangelog:
disabling nodelay
[2015-10-19 16:42:44.107238] D [socket.c:3845:socket_init] 0-gfchangelog:
Configued transport.tcp-user-timeout=0
[2015-10-19 16:42:44.107242] D [socket.c:3928:socket_init] 0-gfchangelog:
SSL support on the I/O path is NOT enabled
[2015-10-19 16:42:44.107248] D [socket.c:3931:socket_init] 0-gfchangelog:
SSL support for glusterd is NOT enabled
[2015-10-19 16:42:44.107255] D [socket.c:3948:socket_init] 0-gfchangelog:
using system polling thread
[2015-10-19 16:42:44.107263] T [rpc-clnt.c:418:rpc_clnt_reconnect]
0-gfchangelog: attempting reconnect
[2015-10-19 16:42:44.107267] T [socket.c:2887:socket_connect]
0-gfchangelog: connecting 0x7f2dbc08c7e0, state=0 gen=0 sock=-1
[2015-10-19 16:42:44.107272] T
[name.c:295:af_unix_client_get_remote_sockaddr] 0-gfchangelog: using
connect-path
/var/run/gluster/changelog-535626c1c4cea4957a54c920f955b1ac.sock
[2015-10-19 16:42:44.107284] T [name.c:111:af_unix_client_bind]
0-gfchangelog: bind-path not specified for unix socket, letting connect to
assign default value
[2015-10-19 16:42:46.107494] T [rpc-clnt.c:1406:rpc_clnt_record]
0-gfchangelog: Auth Info: pid: 0, uid: 0, gid: 0, owner:
[2015-10-19 16:42:46.107543] T
[rpc-clnt.c:1263:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen
500, payload: 436, rpc hdr: 64
[2015-10-19 16:42:46.107589] T [rpc-clnt.c:1600:rpc_clnt_submit]
0-rpc-clnt: submitted request (XID: 0x1 Program: LIBGFCHANGELOG, ProgVers:
1, Proc: 1) to rpc-transport (gfchangelog)
[2015-10-19 16:42:46.107618] D [rpc-clnt-ping.c:281:rpc_clnt_start_ping]
0-gfchangelog: ping timeout is 0, returning
[2015-10-19 16:42:46.109489] T [rpc-clnt.c:663:rpc_clnt_reply_init]
0-gfchangelog: received rpc message (RPC XID: 0x1 Program: LIBGFCHANGELOG,
ProgVers: 1, Proc: 1) from rpc-transport (gfchangelog)
[2015-10-19 16:42:46.146891] I
[gf-history-changelog.c:760:gf_changelog_extract_min_max] 0-gfchangelog:
MIN: 1439390689, MAX: 1445272962, TOTAL CHANGELOGS: 391739
[2015-10-19 16:42:46.147039] I
[gf-history-changelog.c:907:gf_history_changelog] 0-gfchangelog: FINAL:
from: 1445023075, to: 1445272962, changes: 16643
[2015-10-19 16:42:46.148604] D
[gf-history-changelog.c:298:gf_history_changelog_scan] 0-gfchangelog:
hist_done 1, is_last_scan: 0




2.
 /var/log/glusterfs/geo-replication/storage-2/ssh%3A%2F%2Froot%40172.16.10.43%3Agluster%3A%2F%2F127.0.0.1%3Astorage-2.log
[2015-10-18 10:42:44.890366] W
[master(/mnt/sda4/brick-storage-3):1041:process] _GMaster: incomplete sync,
retrying changelogs: CHANGELOG.1445022009 CHANGELOG.1445022024
CHANGELOG.1445022039 CHANGELOG.1445022054 CHANGELOG.1445022069 CHAN
GELOG.1445022084 CHANGELOG.1445022099 CHANGELOG.1445022114
CHANGELOG.1445022129 CHANGELOG.1445022144 CHANGELOG.1445022159
CHANGELOG.1445022174 CHANGELOG.1445022189 CHANGELOG.1445022204
CHANGELOG.1445022219 CHANGELOG.1445022234 CHANGELOG.
1445022249 CHANGELOG.1445022264 CHANGELOG.1445022279 CHANGELOG.1445022294
CHANGELOG.1445022309 CHANGELOG.1445022324 CHANGELOG.1445022339
CHANGELOG.1445022354 CHANGELOG.1445022369 CHANGELOG.1445022384
CHANGELOG.1445022399 CHANGELOG.144502
2414 CHANGELOG.1445022429 CHANGELOG.1445022444 CHANGELOG.1445022459
CHANGELOG.1445022474 CHANGELOG.1445022489 CHANGELOG.1445022504
CHANGELOG.1445022519 CHANGELOG.1445022534 CHANGELOG.1445022549
CHANGELOG.1445022564 CHANGELOG.1445022579 C
HANGELOG.1445022594 CHANGELOG.1445022609 CHANGELOG.1445022624
CHANGELOG.1445022639 CHANGELOG.1445022654 CHANGELOG.1445022669
CHANGELOG.1445022684 CHANGELOG.1445022699 CHANGELOG.1445022714
CHANGELOG.1445022729 CHANGELOG.1445022744 CHANGEL
OG.1445022759 CHANGELOG.1445022774 CHANGELOG.1445022789
CHANGELOG.1445022805 CHANGELOG.1445022820 CHANGELOG.1445022835
CHANGELOG.1445022850 CHANGELOG.1445022865 CHANGELOG.1445022880
CHANGELOG.1445022895 CHANGELOG.1445022910 CHANGELOG.144
5022925 CHANGELOG.1445022940 CHANGELOG.1445022955 CHANGELOG.1445022970
CHANGELOG.1445022985 CHANGELOG.1445023000 CHANGELOG.1445023015
CHANGELOG.1445023030 CHANGELOG.1445023045 CHANGELOG.1445023060
CHANGELOG.1445023075 CHANGELOG.144502309
0 CHANGELOG.1445023105 CHANGELOG.1445023120 CHANGELOG.1445023135
CHANGELOG.1445023150 CHANGELOG.1445023165 CHANGELOG.1445023180
CHANGELOG.1445023195 CHANGELOG.1445023210 CHANGELOG.1445023225
CHANGELOG.1445023240 CHANGELOG.1445023255 CHAN
GELOG.1445023270 CHANGELOG.1445023285 CHANGELOG.1445023300
CHANGELOG.1445023315 CHANGELOG.1445023330 CHANGELOG.1445023345
CHANGELOG.1445023360 CHANGELOG.1445023375 CHANGELOG.1445023390
CHANGELOG.1445023405 CHANGELOG.1445023420 CHANGELOG.
1445023435 CHANGELOG.1445023450 CHANGELOG.1445023465 CHANGELOG.1445023480
CHANGELOG.1445023495 CHANGELOG.1445023510 CHANGELOG.1445023525
CHANGELOG.1445023540 CHANGELOG.1445023555 CHANGELOG.1445023570
CHANGELOG.1445023585 CHANGELOG.144502
3600 CHANGELOG.1445023615 CHANGELOG.1445023630 CHANGELOG.1445023645
CHANGELOG.1445023660 CHANGELOG.1445023675 CHANGELOG.1445023690
CHANGELOG.1445023705 CHANGELOG.1445023720 CHANGELOG.1445023735
CHANGELOG.1445023750 CHANGELOG.1445023765 C
HANGELOG.1445023780 CHANGELOG.1445023796 CHANGELOG.1445023811
CHANGELOG.1445023826 CHANGELOG.1445023841 CHANGELOG.1445023856
CHANGELOG.1445023871 CHANGELOG.1445023886 CHANGELOG.1445023901
CHANGELOG.1445023916 CHANGELOG.1445023931 CHANGEL
OG.1445023946 CHANGELOG.1445023961 CHANGELOG.1445023976
CHANGELOG.1445023991 CHANGELOG.1445024006 CHANGELOG.1445024021
CHANGELOG.1445024036 CHANGELOG.1445024051 CHANGELOG.1445024066
CHANGELOG.1445024081 CHANGELOG.1445024096 CHANGELOG.144
5024111 CHANGELOG.1445024126 CHANGELOG.1445024141 CHANGELOG.1445024156
CHANGELOG.1445024171 CHANGELOG.1445024186 CHANGELOG.1445024201
CHANGELOG.1445024216 CHANGELOG.1445024231 CHANGELOG.1445024246
CHANGELOG.1445024261 CHANGELOG.144502427
6 CHANGELOG.1445024291 CHANGELOG.1445024306 CHANGELOG.1445024321
CHANGELOG.1445024336 CHANGELOG.1445024351 CHANGELOG.1445024366
CHANGELOG.1445024381 CHANGELOG.1445024396 CHANGELOG.1445024411
[2015-10-18 10:43:02.808736] E
[repce(/mnt/sda4/brick-storage-3):207:__call__] RepceClient: call
24010:140490734180160:1445190182.72 (entry_ops) failed on peer with OSError
[2015-10-18 10:43:02.809122] E
[syncdutils(/mnt/sda4/brick-storage-3):276:log_raise_exception] <top>:
FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in
main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in
main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1451,
in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 591, in
crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1106, in
crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1081, in
changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 959, in
process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 914, in
process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
__call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
__call__
    raise res
OSError: [Errno 16] Device or resource busy
[2015-10-18 10:43:02.811510] I
[syncdutils(/mnt/sda4/brick-storage-3):220:finalize] <top>: exiting.
[2015-10-18 10:43:02.819149] I [repce(agent):92:service_loop] RepceServer:
terminating on reaching EOF.
[2015-10-18 10:43:02.819616] I [syncdutils(agent):220:finalize] <top>:
exiting.

Since that I get the repeating error:
[2015-10-18 10:43:13.73946] I [monitor(monitor):221:monitor] Monitor:
------------------------------------------------------------
[2015-10-18 10:43:13.74423] I [monitor(monitor):222:monitor] Monitor:
starting gsyncd worker
[2015-10-18 10:43:13.177963] I [changelogagent(agent):75:__init__]
ChangelogAgent: Agent listining...
[2015-10-18 10:43:13.178881] I
[gsyncd(/mnt/sda4/brick-storage-3):649:main_i] <top>: syncing:
gluster://localhost:storage-2 -> ssh://root@east-storage2
:gluster://localhost:storage-2
[2015-10-18 10:43:17.93168] I
[master(/mnt/sda4/brick-storage-3):83:gmaster_builder] <top>: setting up
xsync change detection mode
[2015-10-18 10:43:17.93669] I
[master(/mnt/sda4/brick-storage-3):404:__init__] _GMaster: using 'rsync' as
the sync engine
[2015-10-18 10:43:17.95079] I
[master(/mnt/sda4/brick-storage-3):83:gmaster_builder] <top>: setting up
changelog change detection mode
[2015-10-18 10:43:17.95390] I
[master(/mnt/sda4/brick-storage-3):404:__init__] _GMaster: using 'rsync' as
the sync engine
[2015-10-18 10:43:17.96730] I
[master(/mnt/sda4/brick-storage-3):83:gmaster_builder] <top>: setting up
changeloghistory change detection mode
[2015-10-18 10:43:17.97048] I
[master(/mnt/sda4/brick-storage-3):404:__init__] _GMaster: using 'rsync' as
the sync engine
[2015-10-18 10:43:19.915657] I
[master(/mnt/sda4/brick-storage-3):1220:register] _GMaster: xsync temp
directory:
/var/lib/misc/glusterfsd/storage-2/ssh%3A%2F%2Froot%40172.16.10.43%3Agluster%3A%2F%2F127.0.0.1%3Astorage-2/535626c1c4cea4957a
54c920f955b1ac/xsync
[2015-10-18 10:43:19.916011] I
[resource(/mnt/sda4/brick-storage-3):1432:service_loop] GLUSTER: Register
time: 1445190199
[2015-10-18 10:43:19.935817] I
[master(/mnt/sda4/brick-storage-3):530:crawlwrap] _GMaster: primary master
with volume id 054d6225-39fc-40f5-9604-5ee4b6dcd8b4 ...
[2015-10-18 10:43:19.938358] I
[master(/mnt/sda4/brick-storage-3):539:crawlwrap] _GMaster: crawl interval:
1 seconds
[2015-10-18 10:43:19.942023] I
[master(/mnt/sda4/brick-storage-3):1135:crawl] _GMaster: starting history
crawl... turns: 1, stime: (1445021993, 0), etime: 1445190199
[2015-10-18 10:43:20.975270] I
[master(/mnt/sda4/brick-storage-3):1164:crawl] _GMaster: slave's time:
(1445021993, 0)
[2015-10-18 10:43:38.50978] E
[repce(/mnt/sda4/brick-storage-3):207:__call__] RepceClient: call
13364:140337327839040:1445190217.96 (entry_ops) failed on peer with OSError
[2015-10-18 10:43:38.51417] E
[syncdutils(/mnt/sda4/brick-storage-3):276:log_raise_exception] <top>:
FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in
main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in
main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1438,
in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 591, in
crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1173, in
crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1081, in
changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 959, in
process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 914, in
process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
__call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
__call__
    raise res
OSError: [Errno 16] Device or resource busy
[2015-10-18 10:43:38.53669] I
[syncdutils(/mnt/sda4/brick-storage-3):220:finalize] <top>: exiting.
[2015-10-18 10:43:38.56446] I [repce(agent):92:service_loop] RepceServer:
terminating on reaching EOF.
[2015-10-18 10:43:38.56844] I [syncdutils(agent):220:finalize] <top>:
exiting.

Any suggestions would be very appreciated, thanks

With regards to all community,
--
Rodion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151020/d7174636/attachment.html>


More information about the Gluster-users mailing list