[Gluster-devel] handling open fds and graph switches
Raghavendra Bhat
rabhat at redhat.com
Wed Aug 7 05:53:42 UTC 2013
On 08/07/2013 05:11 AM, Raghavendra G wrote:
>
> On Wed, Aug 7, 2013 at 12:21 AM, Raghavendra Bhat <rabhat at redhat.com
> <mailto:rabhat at redhat.com>> wrote:
>
> On 08/06/2013 05:22 PM, Raghavendra Gowdappa wrote:
>
>
> ----- Original Message -----
>
> From: "Raghavendra Bhat" <rabhat at redhat.com
> <mailto:rabhat at redhat.com>>
> To: gluster-devel at nongnu.org <mailto:gluster-devel at nongnu.org>
> Sent: Tuesday, August 6, 2013 1:52:40 PM
> Subject: [Gluster-devel] handling open fds and graph switches
>
>
> Hi,
>
> As of now, there is a problem when following set of
> operations are
> performed on a file.
>
> open () => unlink () => do a graph change (not
> reconfigure) => fop on
> the opened fd (may be write)
>
> In the above set of operations, the fop performed on the
> fd after the
> graph switch fails with EBADFD (which should not happen).
> Its because
> when the file is unlinked (assume there are no other
> hardlinks for the
> file), the gfid handle present in the .glusterfs directory
> of the brick
> is removed. Now when graph change happens, all fds have to
> be migrated
> to the new graph. Before that a nameless lookup will be
> sent on the gfid
> (to build the new inode in the new graph). The nameless
> lookup happens
> on the gfid handle. But since the gfid handle is removed
> upon receiving
> the unlink, nameless lookup fails, thus failing the fd
> migration to the
> new graph and the fops on the fd are also failed.
>
> A patch has been sent to handle
> this(http://review.gluster.org/#/c/5428/), where the gfid
> handle is
> removed when the last reference to the file is removed
> (i.e upon getting
> the unlink, it also checks whether there are any open fds
> on the inode.
> If so, then the gfid handle is not removed. Its removed
> when release on
> that fd is received). But that approach might lead to gfid
> handle leaks
> (what if glusterfsd crashes upon unlinking the last entry?
> the gfid
> handle might not have been removed if there are open fds.
> And now if
> glusterfsd crashes, then the gfid handle for that file is
> leaked).
>
> Another approach might be to make posix_lookup do a stat
> on one of the
> fds present on the inode when it has to build a INODE
> HANDLE (which
> happens as part of nameless lookup). The nameless lookup
> suceeds and the
> new inode is looked up in the new graph for the client.
> But after that,
> there are 2 more issues.
>
> 1) After successful completion of the nameless lookup, the
> file has to
> be opened in the new graph. So a syncop_open is sent on
> the new graph
> for the gfid. In posix_open, posix xlator again tries to
> open the file
> using the gfid handle. But since the gfid handle is
> removed, open fails
> and the file is not opened (thus fd migration fails
> again.) We can
> search the list of fds for the inode, find the right fd
> that the fuse
> client is trying to migrate and return that fd. But
> searching the right
> fd is a hard task. (What if a fuse client has opened 2 fds
> with same flags?)
>
> If there is more than one posix fd (fd opened on backend
> filesystem) with same flags, its not really an issue. For our
> purposes it doesn't make any difference. Within glusterfs
> we'll anyways be using a different fd object (to maintain lock
> state etc). At the posix level all we need is an fd opened
> with correct flags. We can dup one of these (posix) fds and
> associate the duped fd with glusterfs fd object. Please note
> that returning glfs_fd_object (with a reference) won't work
> here, since the glusterfs fd object we are migrating might
> have different lock state than the one having posix fd opened
> with same flags. We need to dup the posix fd and associate
> that fd with a new glusterfs fd object.
>
> Ok. Will see this method.
>
> 2) Another problem is open-behind. Fuse xlator after
> nameless lookup,
> sends syncop_open to migrate the fds. Once the syncop_open
> is complete
> and fds are migrated, PARENT_DOWN event is sent on the old
> graph and the
> client xlator sends release on all the fds (if the
> previous syncop_open
> is successful, then its safe to send release from old
> graph as the new
> fd would have been migrated to the new graph, with
> corresponding fd
> present in the brick). But before that in syncop_open,
> open-behind might
> have sent success to the fuse without actually winding the
> open call to
> the below xlators. Now fuse gets success for the open,
> sends PARENT_DOWN
> to old graph, which sends release on the fd. Thus even
> though a fd is
> present from application's point of view, there are no
> mechanisms to
> access the file (as the fds and gfid handles have been
> removed already.)
>
> Introduce a key in xdata "force-open" in open fop and if that
> key is set, make open-behind to not to delay open.
>
> But the problem is syncop_open () does not send any dictionary (it
> will be NULL). We can make open-behind
> check whether xdata is NULL and if so, consider that open call be
> generated internally (not from application) and wind it to the
> below xlator.
>
>
> Hmm.. I am not too sure whether we can rely on the interpretation that
> xdata being NULL means to force open in open-behind. There definitely
> are/will be other use-cases of syncop-open where some might
> inadvertently leave xdata NULL. It always helps in terms of
> understandability, to be explicit on what we want to do. Can't you
> create an xdata in fuse fd migration code and pass that down to
> syncop-open?
Whoever calls syncop_open does not send the xdata as the arugement at
all. It will be like this.
ret = syncop_open (new_subvol, &loc, flags, newfd);
The syncop framework itself sends the xdata as NULL while winding the
call (making syncop framework allocate a new dict before winding and
send it as an argument also wont work in this case, as fuse wont be able
to set any new key).
>
>
>
>
> Please provide feedback on the above issues.
>
>
> Regards,
> Raghavendra Bhat
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
>
>
> --
> Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130807/6bb39bd4/attachment-0001.html>
More information about the Gluster-devel
mailing list