[Gluster-devel] io-threads problem? (was: opendir gets Stale NFS file handle)

Wed Oct 1 03:08:42 UTC 2014

Pranith had RCAed one of the race-conditions where a stale dentry was left in server inode table. The race can be outlined as below (T1 and T2 are two threads):

1. T1: readdirp in storage/posix reads a dentry (say <pgfid1, bname1>) along with metadata information and gfid.
2. T2: unlink (pgfid1, bname1) is done in storage/posix and the dentry <pgfid1, bname1> is purged from server inode table (inode table management is done by protocol/server).
3. T1: links (pgfid1, bname1) with corresponding gfid read in step 1.

Now, since the last unlink was done on <pgfid1, bname1> the dentry remains in server inode table (only in server inode table, since the entry was deleted on the exported brick) resulting in ESTALE errors.

This situation can be hit when T1 does a lookup on the same dentry instead of readdirp. However I am not sure this is a serious problem since entry is deleted from the backend (and we are not giving ESTALE errors for a file/directory which is actually present on backend).

In this case just restarting the volume would make the problem go away since after restarting servers start with fresh inode-cache. I am not sure whether this is the same problem you are facing, but this seems something related.

regards,
Raghavendra.

----- Original Message -----
> From: "Emmanuel Dreyfus" <manu at netbsd.org>
> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "Niels de Vos" <ndevos at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Tuesday, September 30, 2014 5:19:31 PM
> Subject: Re: [Gluster-devel] io-threads problem? (was: opendir gets Stale NFS file handle)
> 
> Raghavendra Gowdappa <rgowdapp at redhat.com> wrote:
> 
> > Is there a
> >possibility that the directory was deleted from some other client? In
> >that case, this is not really an error. Otherwise, there might be some
> >issue.
> 
> I deleted the volume and started over: the problem vanished. I wonder
> how to cope with that on a production machine where data should not be
> deleted like that.
> 
> --
> Emmanuel Dreyfus
> http://hcpnet.free.fr/pubz
> manu at netbsd.org
>