[Gluster-users] [Centos7.1x64 / GlusterFS 3.6.3] Geo-replication faulty state

wodel youchi wodel.youchi at gmail.com
Mon May 4 11:02:57 UTC 2015


Hi,

Centos7.1 x64 updated

GlusterFS 3.6.3 latest

[root at glustersrv1 ~]# rpm -qa | grep gluster
glusterfs-libs-3.6.3-1.el7.x86_64
glusterfs-cli-3.6.3-1.el7.x86_64
glusterfs-api-3.6.3-1.el7.x86_64
glusterfs-server-3.6.3-1.el7.x86_64
glusterfs-3.6.3-1.el7.x86_64
glusterfs-geo-replication-3.6.3-1.el7.x86_64
glusterfs-fuse-3.6.3-1.el7.x86_64

I have two GlusterFS nodes with two volumes in replicated mode.
And I have a 3rd server for Geo-replication.

I recreated the geo-replication yesterday for my two volumes, and this
morning I found one of them in a faulty state.

My workload is oriented small files

These are my volumes options in all three nodes:
performance.cache-size 1GB
performance.write-behind-window-size 8MB
performance.io-thread-count 64

These are my OS parameters in all three nodes:
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2


And these are my logs from the faulty geo-replication session on the master

vim
/var/log/glusterfs/geo-replication/data1/ssh%3A%2F%2Fgeoaccount1%40192.168.0.10%3Agluster%3A%2F%2F127.0.0.1%3Aslavedata1.log

[2015-05-03 18:11:47.381232] I [master(/mnt/brick1/brick):501:crawlwrap]
_GMaster: crawl interval: 60 seconds
[2015-05-03 18:11:47.412904] I [master(/mnt/brick1/brick):1333:crawl]
_GMaster: starting hybrid crawl..., stime: (-1, 0)
[2015-05-03 18:11:49.426054] I [master(/mnt/brick1/brick):1343:crawl]
_GMaster: processing xsync changelog
/var/lib/misc/glusterfsd/data1/ssh%3A%2F%2Fgeoaccount1%40192.168.0.10%3Agluster%3A%2F%2F127.0.0.1%3Aslavedata1/0a3f2e04379aff2570c88fd97292612c/xsync/XSYNC-CHANGELOG.1430673107
[2015-05-03 18:12:45.325345] I [monitor(monitor):141:set_state] Monitor: *new
state: Stable*
.....
[2015-05-03 19:36:08.606129] I [master(/mnt/brick1/brick):1343:crawl]
_GMaster: processing xsync changelog
/var/lib/misc/glusterfsd/data1/ssh%3A%2F%2Fgeoaccount1%40192.168.0.10%3Agluster%3A%2F%2F127.0.0.1%3Aslavedata1/0a3f2e04379aff2570c88fd97292612c/xsync/XSYNC-CHANGELOG.1430673119
[2015-05-03 19:37:53.341598] I [master(/mnt/brick1/brick):1343:crawl]
_GMaster: processing xsync changelog
/var/lib/misc/glusterfsd/data1/ssh%3A%2F%2Fgeoaccount1%40192.168.0.10%3Agluster%3A%2F%2F127.0.0.1%3Aslavedata1/0a3f2e04379aff2570c88fd97292612c/xsync/XSYNC-CHANGELOG.1430673122
[2015-05-03 19:41:21.495309] I [master(/mnt/brick1/brick):1340:crawl]
_GMaster: finished hybrid crawl syncing, stime: (-1, 0)
[2015-05-03 19:41:21.543339] I [master(/mnt/brick1/brick):490:crawlwrap]
_GMaster: primary master with volume id
967ddac3-af34-4c70-8d2b-eb201ebb645d ...
[2015-05-03 19:41:21.630045] I [master(/mnt/brick1/brick):501:crawlwrap]
_GMaster: crawl interval: 3 seconds
[2015-05-03 19:41:21.679512] I [master(/mnt/brick1/brick):1192:crawl]
_GMaster: slave's time: (-1, 0)
[2015-05-03 20:59:15.852770] I [master(/mnt/brick1/brick):514:crawlwrap]
_GMaster: 1 crawls, 371 turns
[2015-05-03 20:59:15.898876] I [master(/mnt/brick1/brick):1192:crawl]
_GMaster: slave's time: (1430678475, 0)
[2015-05-03 22:41:14.966464] I [master(/mnt/brick1/brick):514:crawlwrap]
_GMaster: 1 crawls, 100 turns
[2015-05-03 22:41:15.18826] I [master(/mnt/brick1/brick):1192:crawl]
_GMaster: slave's time: (1430682429, 0)
[2015-05-03 22:41:17.679006] E [resource(/mnt/brick1/brick):221:errlog]
Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no
-i /var/lib/glusterd/geo-replication/tar_ssh.pem
geoaccount1 at glustersrv3.example.com tar --overwrite -xf - -C

*/proc/6769/cwd" returned with 2, saying:[2015-05-03 22:41:17.679337] E
[resource(/mnt/brick1/brick):225:logerr] Popen: ssh> tar:
.gfid/1dabd50c-fc18-400a-b961-61b1f12ffbcc: Cannot open: Operation not
permitted[2015-05-03 22:41:17.679510] E
[resource(/mnt/brick1/brick):225:logerr] Popen: ssh> tar:
.gfid/e4fd508a-986d-4eac-82c4-e53503d311e3: Cannot open: Operation not
permitted*
[2015-05-03 22:41:17.679676] E [resource(/mnt/brick1/brick):225:logerr]
Popen: ssh> tar: Exiting with failure status due to previous errors
[2015-05-03 22:41:17.687088] I [syncdutils(/mnt/brick1/brick):214:finalize]
<top>: exiting.
[2015-05-03 22:41:18.443120] I [repce(agent):92:service_loop] RepceServer:
terminating on reaching EOF.
[2015-05-03 22:41:18.608139] I [syncdutils(agent):214:finalize] <top>:
exiting.
[2015-05-03 22:41:19.79348] I [monitor(monitor):141:set_state] Monitor*:
new state: faulty*
[2015-05-03 22:41:30.583766] I [monitor(monitor):215:monitor] Monitor:
------------------------------------------------------------
[2015-05-03 22:41:30.584100] I [monitor(monitor):216:monitor] Monitor:
starting gsyncd worker
[2015-05-03 22:41:30.697026] I [gsyncd(/mnt/brick1/brick):633:main_i]
<top>: syncing: gluster://localhost:data1 ->
ssh://geoaccount1@glustersrv3.example.com:gluster://localhost:slavedata1
[2015-05-03 22:41:30.697663] I [changelogagent(agent):72:__init__]
ChangelogAgent: Agent listining...
[2015-05-03 22:41:34.274518] I
[master(/mnt/brick1/brick):75:gmaster_builder] <top>: setting up xsync
change detection mode
[2015-05-03 22:41:34.274899] I [master(/mnt/brick1/brick):410:__init__]
_GMaster: using 'tar over ssh' as the sync engine
[2015-05-03 22:41:34.276175] I
[master(/mnt/brick1/brick):75:gmaster_builder] <top>: setting up changelog
change detection mode
[2015-05-03 22:41:34.276448] I [master(/mnt/brick1/brick):410:__init__]
_GMaster: using 'tar over ssh' as the sync engine
[2015-05-03 22:41:34.278003] I
[master(/mnt/brick1/brick):75:gmaster_builder] <top>: setting up
changeloghistory change detection mode
[2015-05-03 22:41:34.278273] I [master(/mnt/brick1/brick):410:__init__]
_GMaster: using 'tar over ssh' as the sync engine
[2015-05-03 22:41:34.310321] I [master(/mnt/brick1/brick):1311:register]
_GMaster: xsync temp directory:
/var/lib/misc/glusterfsd/data1/ssh%3A%2F%2Fgeoaccount1%40192.168.0.10%3Agluster%3A%2F%2F127.0.0.1%3Aslavedata1/0a3f2e04379aff2570c88fd97292612c/xsync
[2015-05-03 22:41:34.310627] I
[resource(/mnt/brick1/brick):1318:service_loop] GLUSTER: Register time:
1430689294
[2015-05-03 22:41:34.334717] I [master(/mnt/brick1/brick):490:crawlwrap]
_GMaster: primary master with volume id
967ddac3-af34-4c70-8d2b-eb201ebb645d ...
[2015-05-03 22:41:34.358408] I [master(/mnt/brick1/brick):501:crawlwrap]
_GMaster: crawl interval: 1 seconds













*[2015-05-03 22:41:34.395153] I [master(/mnt/brick1/brick):1226:crawl]
_GMaster: starting history crawl... turns: 1, stime: (1430682429,
0)[2015-05-03 22:41:34.413995] E [repce(agent):117:worker] <top>: call
failed:Traceback (most recent call last):[2015-05-03 22:41:34.413995] E
[repce(agent):117:worker] <top>: call failed:Traceback (most recent call
last):  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113,
in worker    res = getattr(self.obj, rmeth)(*in_data[2:])  File
"/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 51, in
history    num_parallel)  File
"/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 94, in
cl_history_changelog    cls.raise_changelog_err()  File
"/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in
raise_changelog_err    raise ChangelogException(errn,
os.strerror(errn))ChangelogException: [Errno 61] No data available*
[2015-05-03 22:41:34.427406] E [repce(/mnt/brick1/brick):207:__call__]
RepceClient: call 13302:139956335093568:1430689294.4 (history) failed on
peer with ChangelogException
[2015-05-03 22:41:34.427608] I
[resource(/mnt/brick1/brick):1332:service_loop] GLUSTER: Changelog history
crawl failed, fallback to xsync: 61 - No data available
[2015-05-03 22:41:34.430480] I [master(/mnt/brick1/brick):490:crawlwrap]
_GMaster: primary master with volume id
967ddac3-af34-4c70-8d2b-eb201ebb645d ...
[2015-05-03 22:41:34.453098] I [master(/mnt/brick1/brick):501:crawlwrap]
_GMaster: crawl interval: 60 seconds
[2015-05-03 22:41:34.485538] I [master(/mnt/brick1/brick):1333:crawl]
_GMaster: starting hybrid crawl..., stime: (1430682429, 0)
[2015-05-03 22:41:34.496975] I [master(/mnt/brick1/brick):1343:crawl]
_GMaster: processing xsync changelog
/var/lib/misc/glusterfsd/data1/ssh%3A%2F%2Fgeoaccount1%40192.168.0.10%3Agluster%3A%2F%2F127.0.0.1%3Aslavedata1/0a3f2e04379aff2570c88fd97292612c/xsync/XSYNC-CHANGELOG.1430689294
[2015-05-03 22:41:34.521049] I [master(/mnt/brick1/brick):1340:crawl]
_GMaster: finished hybrid crawl syncing, stime: (1430682429, 0)
[2015-05-03 22:41:34.523505] I [master(/mnt/brick1/brick):490:crawlwrap]
_GMaster: primary master with volume id
967ddac3-af34-4c70-8d2b-eb201ebb645d ...
[2015-05-03 22:41:34.549760] I [master(/mnt/brick1/brick):501:crawlwrap]
_GMaster: crawl interval: 3 seconds
[2015-05-03 22:41:34.594702] I [master(/mnt/brick1/brick):1192:crawl]
_GMaster: slave's time: (1430682429, 0)
------------------------------

vim
/var/log/glusterfs/geo-replication/data1/ssh%3A%2F%2Fgeoaccount1%40192.168.0.10%3Agluster%3A%2F%2F127.0.0.1%3Aslavedata1.%2Fmnt%2Fbrick1%2Fbrick-changes.log


[2015-05-03 21:41:34.408090] I
[gf-history-changelog.c:723:gf_changelog_extract_min_max] 0-glusterfs: MIN:
1424167148, MAX: 1424177370, TOTAL CHANGELOGS: 681
[2015-05-03 21:41:34.408144] I
[gf-history-changelog.c:723:gf_changelog_extract_min_max] 0-glusterfs: MIN:
1424186103, MAX: 1424188630, TOTAL CHANGELOGS: 169
*[2015-05-03 21:41:34.413823] E
[gf-history-changelog.c:716:gf_changelog_extract_min_max] 0-glusterfs:
error extracting max timstamp from htime file
/mnt/brick1/brick/.glusterfs/changelogs/htime/HTIME.1424189085 (reason No
data available)*
[2015-05-03 21:41:49.220713] I
[gf-changelog.c:186:gf_changelog_notification_init] 0-glusterfs: connecting
to changelog socket:
/var/run/gluster/changelog-0a3f2e04379aff2570c88fd97292612c.sock (brick:
/mnt/brick1/brick)
[2015-05-03 21:41:49.220744] W
[gf-changelog.c:196:gf_changelog_notification_init] 0-glusterfs: connection
attempt 1/5...
[2015-05-03 21:41:49.220811] I
[gf-changelog.c:216:gf_changelog_notification_init] 0-glusterfs: connection
successful
[2015-05-03 21:41:49.307287] I
[gf-history-changelog.c:723:gf_changelog_extract_min_max] 0-glusterfs: MIN:
1420460614, MAX: 1423403022, TOTAL CHANGELOGS: 195753
[2015-05-03 21:41:49.307321] I
[gf-history-changelog.c:723:gf_changelog_extract_min_max] 0-glusterfs: MIN:
1423403644, MAX: 1423406321, TOTAL CHANGELOGS: 179
[2015-05-03 21:41:49.307348] I
[gf-history-changelog.c:723:gf_changelog_extract_min_max] 0-glusterfs: MIN:
1423406339, MAX: 1423406475, TOTAL CHANGELOGS: 10
[2015-05-03 21:41:49.307365] I
[gf-history-changelog.c:723:gf_changelog_extract_min_max] 0-glusterfs: MIN:
1423412231, MAX: 1424166900, TOTAL CHANGELOGS: 50197
[2015-05-03 21:41:49.307379] I
[gf-history-changelog.c:723:gf_changelog_extract_min_max] 0-glusterfs: MIN:
1424167148, MAX: 1424177370, TOTAL CHANGELOGS: 681
[2015-05-03 21:41:49.307394] I
[gf-history-changelog.c:723:gf_changelog_extract_min_max] 0-glusterfs: MIN:
1424186103, MAX: 1424188630, TOTAL CHANGELOGS: 169
[2015-05-03 21:41:49.307474] E
[gf-history-changelog.c:716:gf_changelog_extract_min_max] 0-glusterfs:
error extracting max timstamp from htime file
/mnt/brick1/brick/.glusterfs/changelogs/htime/HTIME.1424189085 (reason No
data available)
[2015-05-03 21:42:03.162448] I
[gf-changelog.c:186:gf_changelog_notification_init] 0-glusterfs: connecting
to changelog socket:
/var/run/gluster/changelog-0a3f2e04379aff2570c88fd97292612c.sock (brick:
/mnt/brick1/brick)


------------------------------------

vim
/var/log/glusterfs/geo-replication/data1/ssh%3A%2F%2Fgeoaccount1%40192.168.0.10%3Agluster%3A%2F%2F127.0.0.1%3Aslavedata1.%2Fmnt%2Fbrick1%2Fbrick.gluster.log

[2015-05-03 21:41:18.154462] I [fuse-bridge.c:4921:fuse_thread_proc]
0-fuse: unmounting /tmp/gsyncd-aux-mount-eZkdYQ
[2015-05-03 21:41:18.223625] W [glusterfsd.c:1194:cleanup_and_exit] (-->
0-: received signum (15), shutting down
[2015-05-03 21:41:18.223666] I [fuse-bridge.c:5599:fini] 0-fuse: Unmounting
'/tmp/gsyncd-aux-mount-eZkdYQ'.
[2015-05-03 21:41:33.088542] I [MSGID: 100030] [glusterfsd.c:2018:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.3
(args: /usr/sbin/glusterfs --aux-gfid-mount
--log-file=/var/log/glusterfs/geo-replication/data1/ssh%3A%2F%2Fgeoaccount1%40192.168.0.10%3Agluster%3A%2F%2F127.0.0.1%3Aslavedata1.%2Fmnt%2Fbrick1%2Fbrick.gluster.log
--volfile-server=localhost --volfile-id=data1 --client-pid=-1
/tmp/gsyncd-aux-mount-WPP5HT)
[2015-05-03 21:41:33.224751] I [dht-shared.c:337:dht_init_regex]
0-data1-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$
[2015-05-03 21:41:33.228340] I [client.c:2280:notify] 0-data1-client-0:
parent translators are ready, attempting connect on transport
[2015-05-03 21:41:33.233248] I [client.c:2280:notify] 0-data1-client-1:
parent translators are ready, attempting connect on transport
Final graph:

Any idea?


Thanks in advance.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150504/bf0fbfe0/attachment.html>


More information about the Gluster-users mailing list