[Gluster-users] geo-rep: remote operation failed - No such file or directory

Wed Feb 24 15:43:50 UTC 2016

We can provide workaround steps to resync from beginning without 
deleting Volume(s).

I will send the Session reset details by tomorrow.

regards
Aravinda

On 02/24/2016 09:08 PM, ML mail wrote:
> That's right I saw already a few error messages mentioning "Device or resource busy" and was wondering what it was...
>
> You mean I have to delete the brick on my slave node, delete the volume on my slave node and finally re-create the volume on my slave node in order to start geo-replication from the beginning again? I do not have to touch or delete anything on the master node, right?
>
>
> Regards
> ML
>
>
>
> On Wednesday, February 24, 2016 3:07 PM, Milind Changire <mchangir at redhat.com> wrote:
> ML,
> Since the fixes to geo-rep are yet to get into a release,
> I can only suggest you to be a bit patient.
> Also, since you are using logrotate to rotate logs, you
> will most likely get into the "No such file or directory"
> or "Device or resource busy" scenario on the slave again.
> I'm not saying logrotate is at fault, I'm just saying that
> that specific use case leads to an inconsistent gluster
> state.
>
> Unfortunately, you cannot selectively purge the changelogs.
> You will have to delete the volume and empty the bricks
> and recreate the volume with the empty bricks to start
> all over again.
>
> You can delete the volume with:
> # gluster volume stop <volume name>
> # gluster volume delete <volume name>
>
> --
> Milind
>
>
> ----- Original Message -----
> From: "ML mail" <mlnospam at yahoo.com>
> To: "Milind Changire" <mchangir at redhat.com>
> Cc: "Gluster-users" <gluster-users at gluster.org>
> Sent: Wednesday, February 24, 2016 4:44:27 PM
> Subject: Re: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>
> Thanks Milind again for your help. I understand now the concept and managed to set the required attribute for forcing the resyncing. That worked but unfortunately it is a never ending story, I fix stuff, start geo-rep it goes for a few more files and fails again.
>
> Now I think it will be easier to reset geo-replication and start from scratch again, luckily my volume is only 16 GB big as I am still experimenting. What would be the correct way to reset geo-rep? I don't want to remove the config but I would like to trash all the changelogs, delete the whole data on the slave and re-start geo-rep. How should I proceed?
>
> Regards
> ML
>
>
>
>
> On Wednesday, February 24, 2016 10:14 AM, Milind Changire <mchangir at redhat.com> wrote:
> 1. You could use the script at
>    https://gist.github.com/aravindavk/afb16813261794faa432
>     to create a path from the gfid that you could cd to
>     i.e. for gfid c4b19f1c-cc18-4727-87a4-18de8fe0089e
>
> 2. yes, you have to recursively set the virtual xattr
>     on all entries in the directory tree
>     Also, remember to set a value as well
>     # setfattr -n glusterfs.geo-rep.trigger-sync -v 1 <file-path>
>
> Also, remember to set the virtual xattr via the volume
> mount path and not the brick back-end path.
> You should have geo-replication stopped when you are
> setting the virtual xattr and start it when you are
> done setting the xattr for the entire directory tree.
>
> --
> Milind
>
>
> ----- Original Message -----
> From: "ML mail" <mlnospam at yahoo.com>
> To: "Milind Changire" <mchangir at redhat.com>
> Cc: "Gluster-users" <gluster-users at gluster.org>
> Sent: Wednesday, February 24, 2016 1:46:11 PM
> Subject: Re: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>
> Thank you for explaining me how the symbolic linking works in the the .glusterfs directory. Now regarding your new instructions I have two questions:
>
> 1) How can I find out which directory "OC_DEFAULT_MODULE" on my master brick I should run the
> setfattr command on? My problem here is that there are a lot of OC_DEFAULT_MODULE directories on my brick not just only a single one.
>
>
>
> 2) If I understand your last paragraph correctly, you want me to locate the correct OC_DEFAULT_MODULE directory and recursively use setfattr on each sub-directories and/or files inside that directory, is this correct?
>
> Regards
> ML
>
>
>
> On Wednesday, February 24, 2016 7:29 AM, Milind Changire <mchangir at redhat.com> wrote:
> ML,
> You just need to worry about the very first entry that you found with
> the find command:
>
> $ find .glusterfs -name c4b19f1c-cc18-4727-87a4-18de8fe0089e -ls
> 228215    0 lrwxrwxrwx   1 root     root           66 Feb 19 08:52 .glusterfs/c4/b1/c4b19f1c-cc18-4727-87a4-18de8fe0089e -> ../../92/1b/921bfe8e-81ef-4579-b335-abfa2c7e6afb/OC_DEFAULT_MODULE
>
> Since the back-end entry is a symlink, it means that OC_DEFAULT_MODULE
> is a directory on the master and it is missing on the slave.
> If you try to recursively look at the parent gfids of each of the entries
> then they will always point to symlinks since a directory is always
> represented as a symlink at the glusterfs back-end, and you will follow
> them up to the ROOT gfid.
>
> -----
>
> Now, to get the OC_DEFAULT_MODULE directory replicated on the slave,
> you will have to set the virtual xattr on the entire directory tree
> in pre-order listing i.e. set the virtual xattr on the directory
> starting at OC_DEFAULT_MODULE and then on the entries inside the
> directory, and so on down the directory tree.
>
> --
> Milind
>
>
> ----- Original Message -----
> From: "ML mail" <mlnospam at yahoo.com>
> To: "Milind Changire" <mchangir at redhat.com>
> Cc: "Gluster-users" <gluster-users at gluster.org>
> Sent: Wednesday, February 24, 2016 12:25:26 AM
> Subject: Re: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>
> Hi Milind,
>
> Thanks for the instructions for forcing the data sync of a specific file. I was not able to do that as I have discovered something even more weird by trying to find out the concerned file by GFID with the find command as you suggested. Indeed it looks like I have a symbolic link pointing to another one and then to another and so on, as you can see below:
>
> $ find .glusterfs -name c4b19f1c-cc18-4727-87a4-18de8fe0089e -ls
> 228215    0 lrwxrwxrwx   1 root     root           66 Feb 19 08:52 .glusterfs/c4/b1/c4b19f1c-cc18-4727-87a4-18de8fe0089e -> ../../92/1b/921bfe8e-81ef-4579-b335-abfa2c7e6afb/OC_DEFAULT_MODULE
>
> $ ls -la 92/1b/921bfe8e-81ef-4579-b335-abfa2c7e6afb
> lrwxrwxrwx 1 root root 79 Feb 19 08:52 92/1b/921bfe8e-81ef-4579-b335-abfa2c7e6afb -> ../../d7/9f/d79f2ebd-029c-4ac5-8074-5eef7ff21236/160201_File_1602_XX.xls
>
>
> $ ls -la d7/9f/d79f2ebd-029c-4ac5-8074-5eef7ff21236
> lrwxrwxrwx 1 root root 53 Feb 15 07:34 d7/9f/d79f2ebd-029c-4ac5-8074-5eef7ff21236 -> ../../fd/ea/fdea1fc6-0f2a-43d2-8776-651cc6ea73e8/1602
>
>
> $ ls -la fd/ea/fdea1fc6-0f2a-43d2-8776-651cc6ea73e8
> lrwxrwxrwx 1 root root 55 Feb 15 07:29 fd/ea/fdea1fc6-0f2a-43d2-8776-651cc6ea73e8 -> ../../20/25/20253364-add8-4149-a7cf-cf46d237a45c/Banana
>
>
> Is this normal? I somehow don't understand this weird structure of never ending symbolic links... or am I missing something?
>
>
> Regards
> ML
>
>
>
> On Tuesday, February 23, 2016 6:31 AM, Milind Changire <mchangir at redhat.com> wrote:
> ML,
> You will have to search for the gfid c4b19f1c-cc18-4727-87a4-18de8fe0089e
> at the master cluster brick back-ends and run the following command for
> that specific file on the master cluster to force triggering a data sync [1]
>
> # setfattr -n glusterfs.geo-rep.trigger-sync <file-path>
>
> To search for the file at the brick back-end:
>
> # find /<path-to-brick>/.glusterfs -name c4b19f1c-cc18-4727-87a4-18de8fe0089e
>
> Once path to the file is found at any of the bricks, you can then use
> the setfattr command described above.
>
> Reference:
> [1] feature/changelog: Virtual xattr to trigger explicit sync in geo-rep
>      http://review.gluster.org/#/c/9337/
> --
> Milind
>
>
> ----- Original Message -----
> From: "ML mail" <mlnospam at yahoo.com>
> To: "Milind Changire" <mchangir at redhat.com>
> Cc: "Gluster-users" <gluster-users at gluster.org>
> Sent: Monday, February 22, 2016 9:10:56 PM
> Subject: Re: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>
> Hi Milind,
>
> Thanks for the suggestion, I did that for a few problematic files and it seems to continue but now I am stuck at the following error message on the slave:
>
> [2016-02-22 15:21:30.451133] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-myvolume-geo-client-0: remote operation failed. Path: <gfid:c4b19f1c-cc18-4727-87a4-18de8fe0089e> (c4b19f1c-cc18-4727-87a4-18de8fe0089e) [No such file or directory]
>
> As you can see this message does not include any file or directory name, so I can't go any delete that file or directory. Any other ideas how I may proceed here?
>
> Or maybe would it be easier if I delete the whole directory which I think is affected and start geo-rep from there? Or will this mess things up?
>
> Regards
> ML
>
>
>
> On Monday, February 22, 2016 12:12 PM, Milind Changire <mchangir at redhat.com> wrote:
> ML,
> You could try deleting problematic files on slave to recover geo-replication
> from Faulty state.
>
> However, changelogs generated due to logrotate scenario will still cause
> geo-replication to go into Faulty state frequently if geo-replication
> fails and restarts.
>
> The patches mentioned in an earlier mail are being worked upon and finalized.
> They will be available soon in a release which will avoid geo-replication
> going into a Faulty state.
>
> --
> Milind
>
>
> ----- Original Message -----
> From: "ML mail" <mlnospam at yahoo.com>
> To: "Milind Changire" <mchangir at redhat.com>, "Gluster-users" <gluster-users at gluster.org>
> Sent: Monday, February 22, 2016 1:27:14 PM
> Subject: Re: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>
> Hi Milind,
>
> Any news on this issue? I was wondering how can I fix and restart my geo-replication? Can I simply delete the problematic file(s) on my slave and restart geo-rep?
>
> Regards
> ML
>
>
>
>
>
> On Wednesday, February 17, 2016 4:30 PM, ML mail <mlnospam at yahoo.com> wrote:
>
>
> Hi Milind,
>
> Thank you for your short analysis. Indeed that's exactly what happens, as soon as I restart geo-rep it replays the same over and over as it does not succeed.
>
>
> Now regarding the sequence of the file management operations I am not totally sure how it works but I can tell you that we are using ownCloud v8.2.2 (www.owncloud.org) and as storage for this cloud software we use GlusterFS. So it is very probable that ownCloud works like that: when a user uploads a new file if first creates it with another temporary name which it then either renames or moves after successful upload.
>
>
> I have the feeling this issue is related to my initial issue which I have reported earlier this month:
> https://www.gluster.org/pipermail/gluster-users/2016-February/025176.html
>
> For now my question would be how do I get to restart geo-replication succesfully?
>
> Regards
> ML
>
>
>
> On Wednesday, February 17, 2016 4:10 PM, Milind Changire <mchangir at redhat.com> wrote:
>
>
> As per the slave logs, there is an attempt to RENAME files
> i.e. a .part file getting renamed to a name without the
> .part suffix
>
> Just restarting geo-rep isn't going to help much if
> you've already hit the problem. Since the last CHANGELOG
> is replayed by geo-rep on a restart, you'll most probably
> encounter the same log messages in the logs.
>
> Are the .part files CREATEd, RENAMEd and DELETEd with the
> same name often? Are the operations somewhat in the following
> sequence that happen on the geo-replication master cluster?
>
> CREATE f1.part
> RENAME f1.part f1
> DELETE f1
> CREATE f1.part
> RENAME f1.part f1
> ...
> ...
>
>
> If not, then it would help if you could send the sequence
> of file management operations.
>
> --
> Milind
>
>
> ----- Original Message -----
> From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>
> To: "ML mail" <mlnospam at yahoo.com>
> Cc: "Gluster-users" <gluster-users at gluster.org>, "Milind Changire" <mchangir at redhat.com>
> Sent: Tuesday, February 16, 2016 6:28:21 PM
> Subject: Re: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>
> Ccing Milind, he would be able to help
>
> Thanks and Regards,
> Kotresh H R
>
> ----- Original Message -----
>> From: "ML mail" <mlnospam at yahoo.com>
>> To: "Gluster-users" <gluster-users at gluster.org>
>> Sent: Monday, February 15, 2016 4:41:56 PM
>> Subject: [Gluster-users] geo-rep: remote operation failed - No such file or    directory
>>
>> Hello,
>>
>> I noticed that the geo-replication of a volume has STATUS "Faulty" and while
>> looking in the *.gluster.log file in
>> /var/log/glusterfs/geo-replication-slaves/ on my slave I can see the
>> following relevant problem:
>>
>> [2016-02-15 10:58:40.402516] I [rpc-clnt.c:1847:rpc_clnt_reconfig]
>> 0-myvolume-geo-client-0: changing port to 49152 (from 0)
>> [2016-02-15 10:58:40.403928] I [MSGID: 114057]
>> [client-handshake.c:1437:select_server_supported_programs]
>> 0-myvolume-geo-client-0: Using Program GlusterFS 3.3, Num (1298437), Version
>> (330)
>> [2016-02-15 10:58:40.404130] I [MSGID: 114046]
>> [client-handshake.c:1213:client_setvolume_cbk] 0-myvolume-geo-client-0:
>> Connected to myvolume-geo-client-0, attached to remote volume
>> '/data/myvolume-geo/brick'.
>> [2016-02-15 10:58:40.404150] I [MSGID: 114047]
>> [client-handshake.c:1224:client_setvolume_cbk] 0-myvolume-geo-client-0:
>> Server and Client lk-version numbers are not same, reopening the fds
>> [2016-02-15 10:58:40.410150] I [fuse-bridge.c:5137:fuse_graph_setup] 0-fuse:
>> switched to graph 0
>> [2016-02-15 10:58:40.410223] I [MSGID: 114035]
>> [client-handshake.c:193:client_set_lk_version_cbk] 0-myvolume-geo-client-0:
>> Server lk version = 1
>> [2016-02-15 10:58:40.410370] I [fuse-bridge.c:4030:fuse_init]
>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22 kernel
>> 7.23
>> [2016-02-15 10:58:45.662416] I [MSGID: 109066] [dht-rename.c:1411:dht_rename]
>> 0-myvolume-geo-dht: renaming
>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_03_Rosen.JPG-chunking-2242590604-0.FpKL3SIUb9vKHyjd.part
>> (hash=myvolume-geo-client-0/cache=myvolume-geo-client-0) =>
>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_03_Rosen.JPG-chunking-2242590604-0
>> (hash=myvolume-geo-client-0/cache=<nul>)
>> [2016-02-15 10:58:45.665144] I [MSGID: 109066] [dht-rename.c:1411:dht_rename]
>> 0-myvolume-geo-dht: renaming
>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_03_Rosen.JPG-chunking-2242590604-1.C6l0DEurb2y3Azw4.part
>> (hash=myvolume-geo-client-0/cache=myvolume-geo-client-0) =>
>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_03_Rosen.JPG-chunking-2242590604-1
>> (hash=myvolume-geo-client-0/cache=<nul>)
>> [2016-02-15 10:58:45.749829] I [MSGID: 109066] [dht-rename.c:1411:dht_rename]
>> 0-myvolume-geo-dht: renaming
>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0.ajEnSguUZ7EkzjzT.part
>> (hash=myvolume-geo-client-0/cache=myvolume-geo-client-0) =>
>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0
>> (hash=myvolume-geo-client-0/cache=<nul>)
>> [2016-02-15 10:58:45.750225] W [MSGID: 114031]
>> [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-myvolume-geo-client-0:
>> remote operation failed. Path:
>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0.ajEnSguUZ7EkzjzT.part
>> (9164caeb-740d-4429-a3bd-c85f40c35e11) [No such file or directory]
>> [2016-02-15 10:58:45.750418] W [fuse-bridge.c:1777:fuse_rename_cbk]
>> 0-glusterfs-fuse: 60:
>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0.ajEnSguUZ7EkzjzT.part
>> ->
>> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0
>> => -1 (Device or resource busy)
>> [2016-02-15 10:58:45.767788] I [fuse-bridge.c:4984:fuse_thread_proc] 0-fuse:
>> unmounting /tmp/gsyncd-aux-mount-bZ9SMt
>> [2016-02-15 10:58:45.768063] W [glusterfsd.c:1236:cleanup_and_exit]
>> (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4) [0x7feb610820a4]
>> -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7feb626f45b5]
>> -->/usr/sbin/glusterfs(cleanup_and_exit+0x59) [0x7feb626f4429] ) 0-:
>> received signum (15), shutting down
>> [2016-02-15 10:58:45.768093] I [fuse-bridge.c:5683:fini] 0-fuse: Unmounting
>> '/tmp/gsyncd-aux-mount-bZ9SMt'.
>> [2016-02-15 10:58:54.871855] I [dict.c:473:dict_get]
>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/xlator/system/posix-acl.so(posix_acl_setxattr_cbk+0x26)
>> [0x7f8313dfb166]
>> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/xlator/system/posix-acl.so(handling_other_acl_related_xattr+0x20)
>> [0x7f8313dfb060]
>> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get+0x93)
>> [0x7f831f3f40c3] ) 0-dict: !this || key=system.posix_acl_access [Invalid
>> argument]
>> [2016-02-15 10:58:54.871914] I [dict.c:473:dict_get]
>> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/xlator/system/posix-acl.so(posix_acl_setxattr_cbk+0x26)
>> [0x7f8313dfb166]
>> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/xlator/system/posix-acl.so(handling_other_acl_related_xattr+0xb0)
>> [0x7f8313dfb0f0]
>> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get+0x93)
>> [0x7f831f3f40c3] ) 0-dict: !this || key=system.posix_acl_default [Invalid
>> argument]
>>
>> This error gets repeated forever with always the same files. I tried to stop
>> and restart the geo-rep on the master but still the same problem and geo
>> replication does not proceed. Does anyone have an idea how to fix this?
>>
>> I am using GlusterFS 3.7.6 on Debian 8 with a two node replicate volume (1
>> brick per node) and one single off-site slave node for geo-rep.
>>
>> Regards
>> ML
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users