[Gluster-devel] Report ESTALE as ENOENT

Wed Oct 18 14:09:39 UTC 2017

On Wed, Oct 18, 2017 at 12:04:39PM +0530, Raghavendra G wrote:
> On Wed, Oct 11, 2017 at 4:11 PM, Raghavendra G <raghavendra at gluster.com>
> wrote:
> 
> > We ran into a regression [2][3]. Hence reviving this thread.
> >
> > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1500269
> > [3] https://review.gluster.org/18463
> >
> > On Thu, Mar 31, 2016 at 1:22 AM, J. Bruce Fields <bfields at fieldses.org>
> > wrote:
> >> On Mon, Mar 28, 2016 at 04:21:00PM -0400, Vijay Bellur wrote:
> >> > I would prefer to:
> >> >
> >> > 1. Return ENOENT for all system calls that operate on a path.
> >> >
> >> > 2. ESTALE might be ok for file descriptor based operations.
> >>
> >> Note that operations which operate on paths can fail with ESTALE when
> >> they attempt to look up a component within a directory that no longer
> >> exists.
> >>
> >
> > But, "man 2 rmdir"  or "man 2 unlink" doesn't list ESTALE as a valid
> > error. Also rm doesn't seem to handle ESTALE too [3]
> >
> > [4] https://github.com/coreutils/coreutils/blob/master/src/remove.c#L305
> >
> >
> >> Maybe non-creating open("./foo") returning ENOENT would be reasonable in
> >> this case since that's what you'd get in the local filesystem case, but
> >> creat("./foo") returning ENOENT, for example, isn't something
> >> applications will be written to handle.
> >>
> >> The Linux VFS will retry ESTALE on path-based systemcalls *one* time, to
> >> reduce the chance of ESTALE in those cases.
> >
> >
> > I should've anticipated bug [2] due to this comment. My mistake. Bug [2]
> > is indeed due to kernel not retrying open on receiving an ENOENT error.
> > Glusterfs sent ENOENT because file's inode-number/nodeid changed but same
> > path exists. The correct error would've been ESTALE, but due to our
> > conversion of ESTALE to ENOENT, the latter was sent back to kernel.
> >
> 
> We've an application which does very frequent renames (around 10-15 per
> second). So, single retry by kernel of an open failed with ESTALE is not
> helping us.
> 
> @Bruce/Brian,
> 
> Do you know why VFS chose an approach of retrying instead of a stricter
> synchronization mechanism using locking? For eg., rename and open could've
> been synchronized using a lock.

Cc'ing Jeff, who did this work.  It was a compromise that was good
enough to solve the problems we were seeing at the time.  I forget the
details.

But:

> For eg., a rough psuedocode for open and rename could've been (please
> ignore ordering src,dst locks in rename)
> 
> sys_open ()
> {
>        LOCK (dentry->lock);
>        {
>             lookup path;
>             open (inode)
>        }
>        UNLOCK (dentry->lock;
> }
> 
> sys_rename ()
> {
>          LOCK (dst-dentry->lock);
>          {
>                 LOCK (src-dentry->lock);
>                 {
>                      rename (src, dst);
>                 }
>                 UNLOCK (src-dentry->lock);
>         }
>         UNLOCK (dst-dentry->lock);
> }

We already have adequate locking on the client side.  Users of local
filesystems will never see this problem.

The problem occurs when some process on the server side removes a
directory while we're using it.  No amount of client locking will help
there.

> @Bruce,
> 
> With the current retry model, any suggestions on how to handle applications
> that do frequent renames?

I think Jeff left open the possibility that we might have to increase
the number of retries.

An iron-clad solution would probably require protocol extensions.

I'm curious though what exactly is happening in your use case.  Frequent
renames aren't enough, they need to be losing races with rmdir on other
hosts.

--b.