[Gluster-users] glusterfs geo-replication

Wed Sep 30 15:21:38 UTC 2015

Hi,

I am still getting the same issue, and looking further into this, I have
modified syncdutils.py slightly on the slave node to print more information
on the nature of this OSError:

[2015-09-30 12:19:25.8531] I [gsyncd(slave):649:main_i] <top>: syncing:
gluster://localhost:homegs
[2015-09-30 12:19:26.79925] I [resource(slave):844:service_loop] GLUSTER:
slave listening
[2015-09-30 12:19:30.723542] W [syncdutils(slave):481:errno_wrap] <top>:
about to raise exception for <built-in function
rename>(['.gfid/a780729e-3602-4c30-9966-f31fa5964843/CMakeCache.txt.tmp',
'.gfid/a780729e-3602-4c30-9966-f31fa5964843/CMakeCache.txt']) from
/...[Errno 16] Device or resource busy
[2015-09-30 12:19:30.723845] I [syncdutils(slave):483:errno_wrap] <top>: we
could try deleting
'.glusterfs/a7/80/a780729e-3602-4c30-9966-f31fa5964843/CMakeCache.txt.tmp'
[2015-09-30 12:19:30.723974] E [repce(slave):117:worker] <top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in
worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 725, in
entry_ops
    [ENOENT, EEXIST], [ESTALE])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475,
in errno_wrap
    return call(*arg)
OSError: [Errno 16] Device or resource busy
[2015-09-30 12:19:30.770679] I [repce(slave):92:service_loop] RepceServer:
terminating on reaching EOF.
[2015-09-30 12:19:30.770944] I [syncdutils(slave):220:finalize] <top>:
exiting.

Below is a list of errors:

W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for
<built-in function
rename>(['.gfid/93b6282a-a47f-40fb-b83e-efec5b9c1a4e/.viminfo.tmp',
'.gfid/93b6282a-a47f-40fb-b83e-efec5b9c1a4e/.viminfo']) from /...[Errno 16]
Device or resource busy
W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for
<built-in function
rename>(['.gfid/9c4f77d0-0855-4607-8d65-1a52b3cdbbce/gr_rsync_homes_new.sh',
'.gfid/9c4f77d0-0855-4607-8d65-1a52b3cdbbce/gr_rsync_homes_new.sh~']) from
/...[Errno 16] Device or resource busy
W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for
<built-in function
rename>(['.gfid/a780729e-3602-4c30-9966-f31fa5964843/CMakeCache.txt.tmp',
'.gfid/a780729e-3602-4c30-9966-f31fa5964843/CMakeCache.txt']) from
/...[Errno 16] Device or resource busy
W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for
<built-in function
rename>(['.gfid/a78e4abc-1de4-4f0d-a5e2-1244a0e3b708/depend.internal.tmp',
'.gfid/a78e4abc-1de4-4f0d-a5e2-1244a0e3b708/depend.internal']) from
/...[Errno 16] Device or resource busy
W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for
<built-in function rename>(['.gfid/f5491794-32e2-46f4-8c50-7ca31e34111f/
gluster-volume-info.debug.pl', '.gfid/f5491794-32e2-46f4-8c50-7ca31e34111f/
gluster-volume-info.debug.pl~'])...[Errno 16] Device or resource busy
W [syncdutils(slave):481:errno_wrap] <top>: about to raise exception for
<built-in function rename>(['.gfid/f5491794-32e2-46f4-8c50-7ca31e34111f/
gluster-volume-info.debug.pl', '.gfid/f5491794-32e2-46f4-8c50-7ca31e34111f/
gluster-volume-info.debug.pl~']) from /...[Errno 16] Device or resource busy

I am not entirely clear why rename() would return this ‘Device or resource
busy’.  Also I’m not sure what are these ‘.gfid’ paths exactly: is this
something transient? Where does it live exactly?

I tried deleting the affected files (e.g.
a78e4abc-1de4-4f0d-a5e2-1244a0e3b708/depend.internal.tmp) in .glusterfs/,
and so far that allowed the geo-replication to resume… until it hits
another one of these.

In some cases I found that this gfid points to a directory that doesn’t
exist, e.g.:

ls-alh
/gluster/homegs-brick-1/brick/.glusterfs/a7/8e/a78e4abc-1de4-4f0d-a5e2-1244a0e3b708

lrwxrwxrwx 1 root root 71 Sep 30 11:26
/gluster/homegs-brick-1/brick/.glusterfs/a7/8e/a78e4abc-1de4-4f0d-a5e2-1244a0e3b708
-> ../../0d/be/0dbe0730-5ba9-45d5-91e5-aeb069ab08bf/watchdog_unit_test.dir

but the link is broken.

In that case deleting the
/gluster/homegs-brick-1/brick/.glusterfs/a7/8e/a78e4abc-1de4-4f0d-a5e2-1244a0e3b708
allowed the geo-replication to carry on.

Obviously it doesn’t work if I have to do this 50 times a day…

Any idea what could cause this and how to fix it properly?

Thx,
Thib.
On 26 Aug 2015 12:58 pm, "Thibault Godouet" <tibo92 at godouet.net> wrote:

> Hi,
>
> Given that this is an 'OSError', I should probably have said we use Redhat
> 6.6 64bit and XFS as the brick filesystem.
>
> Has anyone any input on how to troubleshoot this?
>
> Thanks,
> Thibault.
> On 24 Aug 2015 4:03 pm, "Thibault Godouet" <
> Thibault.Godouet at gresearch.co.uk> wrote:
>
>> I have had multiple issues with geo-replication.  It seems to work OK
>> initially, the replica gets up to date, and not long after (e.g. a couple
>> of days), the replication goes into a faulty state and won't get out of it.
>>
>>
>> I have tried a few times now, and last attempt I re-created the slave
>> volume and setup the replication again.  Same symptoms again.
>>
>>
>>
>> I use Gluster 3.7.3, and you will find my setup and log messages at the
>> bottom of the email.
>>
>>
>> Any idea what could cause this and how to fix it?
>>
>> Thanks,
>> Thibault.
>>
>> ps: my setup and log messages:
>>
>>
>>
>> Master:
>>
>>
>>
>> Volume Name: home
>> Type: Replicate
>> Volume ID: 2299a204-a1dc-449d-8556-bc65197373c7
>> Status: Started
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: server4.uberit.net:/gluster/home-brick-1
>> Brick2: server5.uberit.net:/gluster/home-brick-1
>> Options Reconfigured:
>> performance.readdir-ahead: on
>> geo-replication.indexing: on
>> geo-replication.ignore-pid-check: on
>> changelog.changelog: on
>>
>>
>>
>> Slave:
>>
>> Volume Name: homegs
>> Type: Distribute
>> Volume ID: 746dfdc3-650d-4468-9fdd-d621dd215b94
>> Status: Started
>> Number of Bricks: 1
>> Transport-type: tcp
>> Bricks:
>> Brick1: remoteserver1.uberit.net:/gluster/homegs-brick-1/brick
>> Options Reconfigured:
>> performance.readdir-ahead: on
>>
>>
>>
>> The geo-replication status and config (I think I ended up with only
>> defaults values) are:
>>
>>
>>
>> # gluster volume geo-replication home ssh://remoteserver1::homegs status
>>
>> MASTER NODE       MASTER VOL    MASTER BRICK             SLAVE USER
>> SLAVE                           SLAVE NODE    STATUS    CRAWL STATUS
>> LAST_SYNCED
>>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>> server5          home          /gluster/home-brick-1    root
>> ssh://remoteserver1::homegs      N/A           Faulty    N/A             N/A
>> server4          home          /gluster/home-brick-1    root
>> ssh://remoteserver1::homegs      N/A           Faulty    N/A             N/A
>>
>>
>>
>> # gluster volume geo-replication home ssh://remoteserver1::homegs config
>> special_sync_mode: partial
>> state_socket_unencoded:
>> /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.socket
>> gluster_log_file:
>> /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.gluster.log
>> ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no
>> -i /var/lib/glusterd/geo-replication/secret.pem
>> ignore_deletes: false
>> change_detector: changelog
>> gluster_command_dir: /usr/sbin/
>> georep_session_working_dir:
>> /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/
>> state_file:
>> /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.status
>> remote_gsyncd: /nonexistent/gsyncd
>> session_owner: 2299a204-a1dc-449d-8556-bc65197373c7
>> changelog_log_file:
>> /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-changes.log
>> socketdir: /var/run/gluster
>> working_dir:
>> /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs
>> state_detail_file:
>> /var/lib/glusterd/geo-replication/home_remoteserver1_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs-detail.status
>> ssh_command_tar: ssh -oPasswordAuthentication=no
>> -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem
>> pid_file:
>> /var/lib/glusterd/geo-replication/home_remoteserver_homegs/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.pid
>> log_file:
>> /var/log/glusterfs/geo-replication/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs.log
>> gluster_params: aux-gfid-mount acl
>> volume_id: 2299a204-a1dc-449d-8556-bc65197373c7
>>
>>
>>
>> The logs look like on the master on server1:
>>
>>
>>
>> [2015-08-24 15:21:07.955600] I [monitor(monitor):221:monitor] Monitor:
>> ------------------------------------------------------------
>> [2015-08-24 15:21:07.955883] I [monitor(monitor):222:monitor] Monitor:
>> starting gsyncd worker
>> [2015-08-24 15:21:08.69528] I [gsyncd(/gluster/home-brick-1):649:main_i]
>> <top>: syncing: gluster://localhost:home ->
>> ssh://root@vivlinuxinfra1.uberit.net:gluster://localhost:homegs
>> [2015-08-24 15:21:08.70938] I [changelogagent(agent):75:__init__]
>> ChangelogAgent: Agent listining...
>> [2015-08-24 15:21:11.255237] I
>> [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up xsync
>> change detection mode
>> [2015-08-24 15:21:11.255532] I
>> [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the
>> sync engine
>> [2015-08-24 15:21:11.256570] I
>> [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up
>> changelog change detection mode
>> [2015-08-24 15:21:11.256726] I
>> [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the
>> sync engine
>> [2015-08-24 15:21:11.257345] I
>> [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up
>> changeloghistory change detection mode
>> [2015-08-24 15:21:11.257534] I
>> [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the
>> sync engine
>> [2015-08-24 15:21:13.333628] I
>> [master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp
>> directory:
>> /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40172.18.0.169%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync
>> [2015-08-24 15:21:13.333870] I
>> [resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time:
>> 1440426073
>> [2015-08-24 15:21:13.401132] I
>> [master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with
>> volume id 2299a204-a1dc-449d-8556-bc65197373c7 ...
>> [2015-08-24 15:21:13.412795] I
>> [master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1
>> seconds
>> [2015-08-24 15:21:13.427340] I [master(/gluster/home-brick-1):1127:crawl]
>> _GMaster: starting history crawl... turns: 1, stime: (1440411353, 0)
>> [2015-08-24 15:21:14.432327] I [master(/gluster/home-brick-1):1156:crawl]
>> _GMaster: slave's time: (1440411353, 0)
>> [2015-08-24 15:21:14.890889] E
>> [repce(/gluster/home-brick-1):207:__call__] RepceClient: call
>> 20960:140215190427392:1440426074.56 (entry_ops) failed on peer with OSError
>> [2015-08-24 15:21:14.891124] E
>> [syncdutils(/gluster/home-brick-1):276:log_raise_exception] <top>: FAIL:
>> Traceback (most recent call last):
>>   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in
>> main
>>     main_i()
>>   File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in
>> main_i
>>     local.service_loop(*[r for r in [remote] if r])
>>   File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1438,
>> in service_loop
>>     g3.crawlwrap(oneshot=True)
>>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 584, in
>> crawlwrap
>>     self.crawl()
>>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1165,
>> in crawl
>>     self.changelogs_batch_process(changes)
>>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1074,
>> in changelogs_batch_process
>>     self.process(batch)
>>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 952, in
>> process
>>     self.process_change(change, done, retry)
>>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 907, in
>> process_change
>>     failures = self.slave.server.entry_ops(entries)
>>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
>> __call__
>>     return self.ins(self.meth, *a)
>>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
>> __call__
>>     raise res
>> OSError: [Errno 5] Input/output error
>> [2015-08-24 15:21:14.892291] I
>> [syncdutils(/gluster/home-brick-1):220:finalize] <top>: exiting.
>> [2015-08-24 15:21:14.893665] I [repce(agent):92:service_loop]
>> RepceServer: terminating on reaching EOF.
>> [2015-08-24 15:21:14.893879] I [syncdutils(agent):220:finalize] <top>:
>> exiting.
>> [2015-08-24 15:21:15.259360] I [monitor(monitor):282:monitor] Monitor:
>> worker(/gluster/home-brick-1) died in startup phase
>>
>>
>>
>> and on master server2:
>>
>>
>>
>> [2015-08-24 15:21:07.650707] I [monitor(monitor):221:monitor] Monitor:
>> ------------------------------------------------------------
>> [2015-08-24 15:21:07.651144] I [monitor(monitor):222:monitor] Monitor:
>> starting gsyncd worker
>> [2015-08-24 15:21:07.764817] I [gsyncd(/gluster/home-brick-1):649:main_i]
>> <top>: syncing: gluster://localhost:home ->
>> ssh://root@remoteserver1.uberit.net:gluster://localhost:homegs
>> [2015-08-24 15:21:07.768552] I [changelogagent(agent):75:__init__]
>> ChangelogAgent: Agent listining...
>> [2015-08-24 15:21:11.9820] I
>> [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up xsync
>> change detection mode
>> [2015-08-24 15:21:11.10199] I
>> [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the
>> sync engine
>> [2015-08-24 15:21:11.10946] I
>> [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up
>> changelog change detection mode
>> [2015-08-24 15:21:11.11115] I
>> [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the
>> sync engine
>> [2015-08-24 15:21:11.11744] I
>> [master(/gluster/home-brick-1):83:gmaster_builder] <top>: setting up
>> changeloghistory change detection mode
>> [2015-08-24 15:21:11.11933] I
>> [master(/gluster/home-brick-1):404:__init__] _GMaster: using 'rsync' as the
>> sync engine
>> [2015-08-24 15:21:13.59192] I
>> [master(/gluster/home-brick-1):1212:register] _GMaster: xsync temp
>> directory:
>> /var/lib/misc/glusterfsd/home/ssh%3A%2F%2Froot%40IPADDRESS%3Agluster%3A%2F%2F127.0.0.1%3Ahomegs/62d98e8cc00a34eb85b4fe6d6fd3ba33/xsync
>> [2015-08-24 15:21:13.59454] I
>> [resource(/gluster/home-brick-1):1432:service_loop] GLUSTER: Register time:
>> 1440426073
>> [2015-08-24 15:21:13.113203] I
>> [master(/gluster/home-brick-1):523:crawlwrap] _GMaster: primary master with
>> volume id 2299a204-a1dc-449d-8556-bc65197373c7 ...
>> [2015-08-24 15:21:13.122018] I
>> [master(/gluster/home-brick-1):532:crawlwrap] _GMaster: crawl interval: 1
>> seconds
>> [2015-08-24 15:21:23.209912] E
>> [repce(/gluster/home-brick-1):207:__call__] RepceClient: call
>> 1561:140164806457088:1440426083.11 (keep_alive) failed on peer with OSError
>> [2015-08-24 15:21:23.210119] E
>> [syncdutils(/gluster/home-brick-1):276:log_raise_exception] <top>: FAIL:
>> Traceback (most recent call last):
>>   File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line
>> 306, in twrap
>>     tf(*aa)
>>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 438, in
>> keep_alive
>>     cls.slave.server.keep_alive(vi)
>>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
>> __call__
>>     return self.ins(self.meth, *a)
>>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
>> __call__
>>     raise res
>> OSError: [Errno 22] Invalid argument
>> [2015-08-24 15:21:23.210975] I
>> [syncdutils(/gluster/home-brick-1):220:finalize] <top>: exiting.
>> [2015-08-24 15:21:23.212455] I [repce(agent):92:service_loop]
>> RepceServer: terminating on reaching EOF.
>> [2015-08-24 15:21:23.212707] I [syncdutils(agent):220:finalize] <top>:
>> exiting.
>> [2015-08-24 15:21:24.23336] I [monitor(monitor):282:monitor] Monitor:
>> worker(/gluster/home-brick-1) died in startup phase
>>
>>
>>
>> and on the slave (in a different timezone, one hour behind):
>>
>>
>>
>> [2015-08-24 14:22:02.923098] I [resource(slave):844:service_loop]
>> GLUSTER: slave listening
>> [2015-08-24 14:22:07.606774] E [repce(slave):117:worker] <top>: call
>> failed:
>> Traceback (most recent call last):
>>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in
>> worker
>>     res = getattr(self.obj, rmeth)(*in_data[2:])
>>   File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 731,
>> in entry_ops
>>     [ESTALE, EINVAL])
>>   File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line
>> 475, in errno_wrap
>>     return call(*arg)
>>   File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 78,
>> in lsetxattr
>>     cls.raise_oserr()
>>   File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37,
>> in raise_oserr
>>     raise OSError(errn, os.strerror(errn))
>> OSError: [Errno 5] Input/output error
>> [2015-08-24 14:22:07.652092] I [repce(slave):92:service_loop]
>> RepceServer: terminating on reaching EOF.
>> [2015-08-24 14:22:07.652364] I [syncdutils(slave):220:finalize] <top>:
>> exiting.
>>
>>
>>
>> --------------
>> G-RESEARCH believes the information provided herein is reliable. While
>> every care has been taken to ensure accuracy, the information is furnished
>> to the recipients with no warranty as to the completeness and accuracy of
>> its contents and on condition that any errors or omissions shall not be
>> made the basis of any claim, demand or cause of action.
>> The information in this email is intended only for the named recipient.
>> If you are not the intended recipient please notify us immediately and do
>> not copy, distribute or take action based on this e-mail.
>> All messages sent to and from this e-mail address will be logged by
>> G-RESEARCH and are subject to archival storage, monitoring, review and
>> disclosure.
>> G-RESEARCH is the trading name of Trenchant Limited, 5th Floor,
>> Whittington House, 19-30 Alfred Place, London WC1E 7EA.
>> Trenchant Limited is a company registered in England with company number
>> 08127121.
>> --------------
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150930/4fe42c31/attachment.html>