[Gluster-users] geo-rep: remote operation failed - No such file or directory

Thu Feb 25 10:57:05 UTC 2016

Steps to force Geo-rep to sync from beginning

1. Stop Geo-replication
2. Disable the Changelog using `gluster volume set <MASTER VOLNAME> 
changelog.changelog off`
3. Delete all changelogs and htime files from Brick backend of Master 
Volume, $BRICK/.glusterfs/changelogs
4. Remove stime xattrs from all Brick root of Master Volume (setfattr -x 
trusted.glusterfs.<MASTERVOL_ID>.<SLAVEVOL_ID>.stime)
     Where ID is `gluster volume info <VOLNAME> | grep ID`
5. Enable changelog again `gluster volume set <MASTER VOLNAME> 
changelog.changelog on`
5. Clean the Slave Volume by deleting all content.
6. Start Geo-replication as usual, This will start syncing like fresh setup.

You can try this re-setup with 3.7.9 release(which is expected anytime 
before mid March), We are trying to merge Geo-rep patches related to 
this issue for glusterfs-3.7.9

Geo-rep should cleanup this xattrs when session is deleted, We will work 
on that fix in future releases
BUG: https://bugzilla.redhat.com/show_bug.cgi?id=1311926

regards
Aravinda

On 02/24/2016 09:59 PM, ML mail wrote:
> That would be great thank you. For me it is not an option to delete the volume on my master node (2 nodes, 1 brick per node). On the other hand no problem to delete the volume on the slave node which is only used for geo-rep.
>
> Regards
> ML
>
>
>
> On Wednesday, February 24, 2016 4:44 PM, Aravinda <avishwan at redhat.com> wrote:
> We can provide workaround steps to resync from beginning without
> deleting Volume(s).
>
> I will send the Session reset details by tomorrow.
>
> regards
> Aravinda
>
> On 02/24/2016 09:08 PM, ML mail wrote:
>> That's right I saw already a few error messages mentioning "Device or resource busy" and was wondering what it was...
>>
>> You mean I have to delete the brick on my slave node, delete the volume on my slave node and finally re-create the volume on my slave node in order to start geo-replication from the beginning again? I do not have to touch or delete anything on the master node, right?
>>
>>
>> Regards
>> ML
>>
>>
>>
>> On Wednesday, February 24, 2016 3:07 PM, Milind Changire <mchangir at redhat.com> wrote:
>> ML,
>> Since the fixes to geo-rep are yet to get into a release,
>> I can only suggest you to be a bit patient.
>> Also, since you are using logrotate to rotate logs, you
>> will most likely get into the "No such file or directory"
>> or "Device or resource busy" scenario on the slave again.
>> I'm not saying logrotate is at fault, I'm just saying that
>> that specific use case leads to an inconsistent gluster
>> state.
>>
>> Unfortunately, you cannot selectively purge the changelogs.
>> You will have to delete the volume and empty the bricks
>> and recreate the volume with the empty bricks to start
>> all over again.
>>
>> You can delete the volume with:
>> # gluster volume stop <volume name>
>> # gluster volume delete <volume name>
>>
>> --
>> Milind
>>
>>
>> ----- Original Message -----
>> From: "ML mail" <mlnospam at yahoo.com>
>> To: "Milind Changire" <mchangir at redhat.com>
>> Cc: "Gluster-users" <gluster-users at gluster.org>
>> Sent: Wednesday, February 24, 2016 4:44:27 PM
>> Subject: Re: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>>
>> Thanks Milind again for your help. I understand now the concept and managed to set the required attribute for forcing the resyncing. That worked but unfortunately it is a never ending story, I fix stuff, start geo-rep it goes for a few more files and fails again.
>>
>> Now I think it will be easier to reset geo-replication and start from scratch again, luckily my volume is only 16 GB big as I am still experimenting. What would be the correct way to reset geo-rep? I don't want to remove the config but I would like to trash all the changelogs, delete the whole data on the slave and re-start geo-rep. How should I proceed?
>>
>> Regards
>> ML
>>
>>
>>
>>
>> On Wednesday, February 24, 2016 10:14 AM, Milind Changire <mchangir at redhat.com> wrote:
>> 1. You could use the script at
>>     https://gist.github.com/aravindavk/afb16813261794faa432
>>      to create a path from the gfid that you could cd to
>>      i.e. for gfid c4b19f1c-cc18-4727-87a4-18de8fe0089e
>>
>> 2. yes, you have to recursively set the virtual xattr
>>      on all entries in the directory tree
>>      Also, remember to set a value as well
>>      # setfattr -n glusterfs.geo-rep.trigger-sync -v 1 <file-path>
>>
>> Also, remember to set the virtual xattr via the volume
>> mount path and not the brick back-end path.
>> You should have geo-replication stopped when you are
>> setting the virtual xattr and start it when you are
>> done setting the xattr for the entire directory tree.
>>
>> --
>> Milind
>>
>>
>> ----- Original Message -----
>> From: "ML mail" <mlnospam at yahoo.com>
>> To: "Milind Changire" <mchangir at redhat.com>
>> Cc: "Gluster-users" <gluster-users at gluster.org>
>> Sent: Wednesday, February 24, 2016 1:46:11 PM
>> Subject: Re: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>>
>> Thank you for explaining me how the symbolic linking works in the the .glusterfs directory. Now regarding your new instructions I have two questions:
>>
>> 1) How can I find out which directory "OC_DEFAULT_MODULE" on my master brick I should run the
>> setfattr command on? My problem here is that there are a lot of OC_DEFAULT_MODULE directories on my brick not just only a single one.
>>
>>
>>
>> 2) If I understand your last paragraph correctly, you want me to locate the correct OC_DEFAULT_MODULE directory and recursively use setfattr on each sub-directories and/or files inside that directory, is this correct?
>>
>> Regards
>> ML
>>
>>
>>
>> On Wednesday, February 24, 2016 7:29 AM, Milind Changire <mchangir at redhat.com> wrote:
>> ML,
>> You just need to worry about the very first entry that you found with
>> the find command:
>>
>> $ find .glusterfs -name c4b19f1c-cc18-4727-87a4-18de8fe0089e -ls
>> 228215    0 lrwxrwxrwx   1 root     root           66 Feb 19 08:52 .glusterfs/c4/b1/c4b19f1c-cc18-4727-87a4-18de8fe0089e -> ../../92/1b/921bfe8e-81ef-4579-b335-abfa2c7e6afb/OC_DEFAULT_MODULE
>>
>> Since the back-end entry is a symlink, it means that OC_DEFAULT_MODULE
>> is a directory on the master and it is missing on the slave.
>> If you try to recursively look at the parent gfids of each of the entries
>> then they will always point to symlinks since a directory is always
>> represented as a symlink at the glusterfs back-end, and you will follow
>> them up to the ROOT gfid.
>>
>> -----
>>
>> Now, to get the OC_DEFAULT_MODULE directory replicated on the slave,
>> you will have to set the virtual xattr on the entire directory tree
>> in pre-order listing i.e. set the virtual xattr on the directory
>> starting at OC_DEFAULT_MODULE and then on the entries inside the
>> directory, and so on down the directory tree.
>>
>> --
>> Milind
>>
>>
>> ----- Original Message -----
>> From: "ML mail" <mlnospam at yahoo.com>
>> To: "Milind Changire" <mchangir at redhat.com>
>> Cc: "Gluster-users" <gluster-users at gluster.org>
>> Sent: Wednesday, February 24, 2016 12:25:26 AM
>> Subject: Re: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>>
>> Hi Milind,
>>
>> Thanks for the instructions for forcing the data sync of a specific file. I was not able to do that as I have discovered something even more weird by trying to find out the concerned file by GFID with the find command as you suggested. Indeed it looks like I have a symbolic link pointing to another one and then to another and so on, as you can see below:
>>
>> $ find .glusterfs -name c4b19f1c-cc18-4727-87a4-18de8fe0089e -ls
>> 228215    0 lrwxrwxrwx   1 root     root           66 Feb 19 08:52 .glusterfs/c4/b1/c4b19f1c-cc18-4727-87a4-18de8fe0089e -> ../../92/1b/921bfe8e-81ef-4579-b335-abfa2c7e6afb/OC_DEFAULT_MODULE
>>
>> $ ls -la 92/1b/921bfe8e-81ef-4579-b335-abfa2c7e6afb
>> lrwxrwxrwx 1 root root 79 Feb 19 08:52 92/1b/921bfe8e-81ef-4579-b335-abfa2c7e6afb -> ../../d7/9f/d79f2ebd-029c-4ac5-8074-5eef7ff21236/160201_File_1602_XX.xls
>>
>>
>> $ ls -la d7/9f/d79f2ebd-029c-4ac5-8074-5eef7ff21236
>> lrwxrwxrwx 1 root root 53 Feb 15 07:34 d7/9f/d79f2ebd-029c-4ac5-8074-5eef7ff21236 -> ../../fd/ea/fdea1fc6-0f2a-43d2-8776-651cc6ea73e8/1602
>>
>>
>> $ ls -la fd/ea/fdea1fc6-0f2a-43d2-8776-651cc6ea73e8
>> lrwxrwxrwx 1 root root 55 Feb 15 07:29 fd/ea/fdea1fc6-0f2a-43d2-8776-651cc6ea73e8 -> ../../20/25/20253364-add8-4149-a7cf-cf46d237a45c/Banana
>>
>>
>> Is this normal? I somehow don't understand this weird structure of never ending symbolic links... or am I missing something?
>>
>>
>> Regards
>> ML
>>
>>
>>
>> On Tuesday, February 23, 2016 6:31 AM, Milind Changire <mchangir at redhat.com> wrote:
>> ML,
>> You will have to search for the gfid c4b19f1c-cc18-4727-87a4-18de8fe0089e
>> at the master cluster brick back-ends and run the following command for
>> that specific file on the master cluster to force triggering a data sync [1]
>>
>> # setfattr -n glusterfs.geo-rep.trigger-sync <file-path>
>>
>> To search for the file at the brick back-end:
>>
>> # find /<path-to-brick>/.glusterfs -name c4b19f1c-cc18-4727-87a4-18de8fe0089e
>>
>> Once path to the file is found at any of the bricks, you can then use
>> the setfattr command described above.
>>
>> Reference:
>> [1] feature/changelog: Virtual xattr to trigger explicit sync in geo-rep
>>       http://review.gluster.org/#/c/9337/
>> --
>> Milind
>>
>>
>> ----- Original Message -----
>> From: "ML mail" <mlnospam at yahoo.com>
>> To: "Milind Changire" <mchangir at redhat.com>
>> Cc: "Gluster-users" <gluster-users at gluster.org>
>> Sent: Monday, February 22, 2016 9:10:56 PM
>> Subject: Re: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>>
>> Hi Milind,
>>
>> Thanks for the suggestion, I did that for a few problematic files and it seems to continue but now I am stuck at the following error message on the slave:
>>
>> [2016-02-22 15:21:30.451133] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-myvolume-geo-client-0: remote operation failed. Path: <gfid:c4b19f1c-cc18-4727-87a4-18de8fe0089e> (c4b19f1c-cc18-4727-87a4-18de8fe0089e) [No such file or directory]
>>
>> As you can see this message does not include any file or directory name, so I can't go any delete that file or directory. Any other ideas how I may proceed here?
>>
>> Or maybe would it be easier if I delete the whole directory which I think is affected and start geo-rep from there? Or will this mess things up?
>>
>> Regards
>> ML
>>
>>
>>
>> On Monday, February 22, 2016 12:12 PM, Milind Changire <mchangir at redhat.com> wrote:
>> ML,
>> You could try deleting problematic files on slave to recover geo-replication
>> from Faulty state.
>>
>> However, changelogs generated due to logrotate scenario will still cause
>> geo-replication to go into Faulty state frequently if geo-replication
>> fails and restarts.
>>
>> The patches mentioned in an earlier mail are being worked upon and finalized.
>> They will be available soon in a release which will avoid geo-replication
>> going into a Faulty state.
>>
>> --
>> Milind
>>
>>
>> ----- Original Message -----
>> From: "ML mail" <mlnospam at yahoo.com>
>> To: "Milind Changire" <mchangir at redhat.com>, "Gluster-users" <gluster-users at gluster.org>
>> Sent: Monday, February 22, 2016 1:27:14 PM
>> Subject: Re: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>>
>> Hi Milind,
>>
>> Any news on this issue? I was wondering how can I fix and restart my geo-replication? Can I simply delete the problematic file(s) on my slave and restart geo-rep?
>>
>> Regards
>> ML
>>
>>
>>
>>
>>
>> On Wednesday, February 17, 2016 4:30 PM, ML mail <mlnospam at yahoo.com> wrote:
>>
>>
>> Hi Milind,
>>
>> Thank you for your short analysis. Indeed that's exactly what happens, as soon as I restart geo-rep it replays the same over and over as it does not succeed.
>>
>>
>> Now regarding the sequence of the file management operations I am not totally sure how it works but I can tell you that we are using ownCloud v8.2.2 (www.owncloud.org) and as storage for this cloud software we use GlusterFS. So it is very probable that ownCloud works like that: when a user uploads a new file if first creates it with another temporary name which it then either renames or moves after successful upload.
>>
>>
>> I have the feeling this issue is related to my initial issue which I have reported earlier this month:
>> https://www.gluster.org/pipermail/gluster-users/2016-February/025176.html
>>
>> For now my question would be how do I get to restart geo-replication succesfully?
>>
>> Regards
>> ML
>>
>>
>>
>> On Wednesday, February 17, 2016 4:10 PM, Milind Changire <mchangir at redhat.com> wrote:
>>
>>
>> As per the slave logs, there is an attempt to RENAME files
>> i.e. a .part file getting renamed to a name without the
>> .part suffix
>>
>> Just restarting geo-rep isn't going to help much if
>> you've already hit the problem. Since the last CHANGELOG
>> is replayed by geo-rep on a restart, you'll most probably
>> encounter the same log messages in the logs.
>>
>> Are the .part files CREATEd, RENAMEd and DELETEd with the
>> same name often? Are the operations somewhat in the following
>> sequence that happen on the geo-replication master cluster?
>>
>> CREATE f1.part
>> RENAME f1.part f1
>> DELETE f1
>> CREATE f1.part
>> RENAME f1.part f1
>> ...
>> ...
>>
>>
>> If not, then it would help if you could send the sequence
>> of file management operations.
>>
>> --
>> Milind
>>
>>
>> ----- Original Message -----
>> From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>
>> To: "ML mail" <mlnospam at yahoo.com>
>> Cc: "Gluster-users" <gluster-users at gluster.org>, "Milind Changire" <mchangir at redhat.com>
>> Sent: Tuesday, February 16, 2016 6:28:21 PM
>> Subject: Re: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>>
>> Ccing Milind, he would be able to help
>>
>> Thanks and Regards,
>> Kotresh H R
>>
>> ----- Original Message -----
>>> From: "ML mail" <mlnospam at yahoo.com>
>>> To: "Gluster-users" <gluster-users at gluster.org>
>>> Sent: Monday, February 15, 2016 4:41:56 PM
>>> Subject: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>>>
>>> Hello,
>>>
>>> I noticed that the geo-replication of a volume has STATUS "Faulty" and while
>>> looking in the *.gluster.log file in
>>> /var/log/glusterfs/geo-replication-slaves/ on my slave I can see the
>>> following relevant problem:
>>>
>>> [2016-02-15 10:58:40.402516] I [rpc-clnt.c:1847:rpc_clnt_reconfig]
>>> 0-myvolume-geo-client-0: changing port to 49152 (from 0)
>>> [2016-02-15 10:58:40.403928] I [MSGID: 114057]
>>> [client-handshake.c:1437:select_server_supported_programs]
>>> 0-myvolume-geo-client-0: Using Program GlusterFS 3.3, Num (1298437), Version
>>> (330)
>>> [2016-02-15 10:58:40.404130] I [MSGID: 114046]
>>> [client-handshake.c:1213:client_setvolume_cbk] 0-myvolume-geo-client-0:
>>> Connected to myvolume-geo-client-0, attached to remote volume
>>> '/data/myvolume-geo/brick'.
>>> [2016-02-15 10:58:40.404150] I [MSGID: 114047]
>>> [client-handshake.c:1224:client_setvolume_cbk] 0-myvolume-geo-client-0:
>>> Server and Client lk-version numbers are not same, reopening the fds
>>> [2016-02-15 10:58:40.410150] I [fuse-bridge.c:5137:fuse_graph_setup] 0-fuse:
>>> switched to graph 0
>>> [2016-02-15 10:58:40.410223] I [MSGID: 114035]
>>> [client-handshake.c:193:client_set_lk_version_cbk] 0-myvolume-geo-client-0:
>>> Server lk version = 1
>>> [2016-02-15 10:58:40.410370] I [fuse-bridge.c:4030:fuse_init]
>>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22 kernel
>>> 7.23
>>> [2016-02-15 10:58:45.662416] I [MSGID: 109066] [dht-rename.c:1411:dht_rename]
>>> 0-myvolume-geo-dht: renaming
>>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_03_Rosen.JPG-chunking-2242590604-0.FpKL3SIUb9vKHyjd.part
>>> (hash=myvolume-geo-client-0/cache=myvolume-geo-client-0) =>
>>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_03_Rosen.JPG-chunking-2242590604-0
>>> (hash=myvolume-geo-client-0/cache=<nul>)
>>> [2016-02-15 10:58:45.665144] I [MSGID: 109066] [dht-rename.c:1411:dht_rename]
>>> 0-myvolume-geo-dht: renaming
>>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_03_Rosen.JPG-chunking-2242590604-1.C6l0DEurb2y3Azw4.part
>>> (hash=myvolume-geo-client-0/cache=myvolume-geo-client-0) =>
>>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_03_Rosen.JPG-chunking-2242590604-1
>>> (hash=myvolume-geo-client-0/cache=<nul>)
>>> [2016-02-15 10:58:45.749829] I [MSGID: 109066] [dht-rename.c:1411:dht_rename]
>>> 0-myvolume-geo-dht: renaming
>>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0.ajEnSguUZ7EkzjzT.part
>>> (hash=myvolume-geo-client-0/cache=myvolume-geo-client-0) =>
>>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0
>>> (hash=myvolume-geo-client-0/cache=<nul>)
>>> [2016-02-15 10:58:45.750225] W [MSGID: 114031]
>>> [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-myvolume-geo-client-0:
>>> remote operation failed. Path:
>>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0.ajEnSguUZ7EkzjzT.part
>>> (9164caeb-740d-4429-a3bd-c85f40c35e11) [No such file or directory]
>>> [2016-02-15 10:58:45.750418] W [fuse-bridge.c:1777:fuse_rename_cbk]
>>> 0-glusterfs-fuse: 60:
>>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0.ajEnSguUZ7EkzjzT.part
>>> ->
>>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0
>>> => -1 (Device or resource busy)
>>> [2016-02-15 10:58:45.767788] I [fuse-bridge.c:4984:fuse_thread_proc] 0-fuse:
>>> unmounting /tmp/gsyncd-aux-mount-bZ9SMt
>>> [2016-02-15 10:58:45.768063] W [glusterfsd.c:1236:cleanup_and_exit]
>>> (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4) [0x7feb610820a4]
>>> -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7feb626f45b5]
>>> -->/usr/sbin/glusterfs(cleanup_and_exit+0x59) [0x7feb626f4429] ) 0-:
>>> received signum (15), shutting down
>>> [2016-02-15 10:58:45.768093] I [fuse-bridge.c:5683:fini] 0-fuse: Unmounting
>>> '/tmp/gsyncd-aux-mount-bZ9SMt'.
>>> [2016-02-15 10:58:54.871855] I [dict.c:473:dict_get]
>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/xlator/system/posix-acl.so(posix_acl_setxattr_cbk+0x26)
>>> [0x7f8313dfb166]
>>> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/xlator/system/posix-acl.so(handling_other_acl_related_xattr+0x20)
>>> [0x7f8313dfb060]
>>> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get+0x93)
>>> [0x7f831f3f40c3] ) 0-dict: !this || key=system.posix_acl_access [Invalid
>>> argument]
>>> [2016-02-15 10:58:54.871914] I [dict.c:473:dict_get]
>>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/xlator/system/posix-acl.so(posix_acl_setxattr_cbk+0x26)
>>> [0x7f8313dfb166]
>>> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/xlator/system/posix-acl.so(handling_other_acl_related_xattr+0xb0)
>>> [0x7f8313dfb0f0]
>>> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get+0x93)
>>> [0x7f831f3f40c3] ) 0-dict: !this || key=system.posix_acl_default [Invalid
>>> argument]
>>>
>>> This error gets repeated forever with always the same files. I tried to stop
>>> and restart the geo-rep on the master but still the same problem and geo
>>> replication does not proceed. Does anyone have an idea how to fix this?
>>>
>>> I am using GlusterFS 3.7.6 on Debian 8 with a two node replicate volume (1
>>> brick per node) and one single off-site slave node for geo-rep.
>>>
>>> Regards
>>> ML
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users