[Gluster-devel] handling open fds and graph switches

Raghavendra Bhat rabhat at redhat.com
Wed Aug 7 05:53:42 UTC 2013


On 08/07/2013 05:11 AM, Raghavendra G wrote:
>
> On Wed, Aug 7, 2013 at 12:21 AM, Raghavendra Bhat <rabhat at redhat.com 
> <mailto:rabhat at redhat.com>> wrote:
>
>     On 08/06/2013 05:22 PM, Raghavendra Gowdappa wrote:
>
>
>         ----- Original Message -----
>
>             From: "Raghavendra Bhat" <rabhat at redhat.com
>             <mailto:rabhat at redhat.com>>
>             To: gluster-devel at nongnu.org <mailto:gluster-devel at nongnu.org>
>             Sent: Tuesday, August 6, 2013 1:52:40 PM
>             Subject: [Gluster-devel] handling open fds and graph switches
>
>
>             Hi,
>
>             As of now, there is a problem when following set of
>             operations are
>             performed on a file.
>
>             open () => unlink () => do a graph change (not
>             reconfigure) => fop on
>             the opened fd (may be write)
>
>             In the above set of operations, the fop performed on the
>             fd after the
>             graph switch fails with EBADFD (which should not happen).
>             Its because
>             when the file is unlinked (assume there are no other
>             hardlinks for the
>             file), the gfid handle present in the .glusterfs directory
>             of the brick
>             is removed. Now when graph change happens, all fds have to
>             be migrated
>             to the new graph. Before that a nameless lookup will be
>             sent on the gfid
>             (to build the new inode in the new graph). The nameless
>             lookup happens
>             on the gfid handle. But since the gfid handle is removed
>             upon receiving
>             the unlink, nameless lookup fails, thus failing the fd
>             migration to the
>             new graph and the fops on the fd are also failed.
>
>             A patch has been sent to handle
>             this(http://review.gluster.org/#/c/5428/), where the gfid
>             handle is
>             removed when the last reference to the file is removed
>             (i.e upon getting
>             the unlink, it also checks whether there are any open fds
>             on the inode.
>             If so, then the gfid handle is not removed. Its removed
>             when release on
>             that fd is received). But that approach might lead to gfid
>             handle leaks
>             (what if glusterfsd crashes upon unlinking the last entry?
>             the gfid
>             handle might not have been removed if there are open fds.
>             And now if
>             glusterfsd crashes, then the gfid handle for that file is
>             leaked).
>
>             Another approach might be to make posix_lookup do a stat
>             on one of the
>             fds present on the inode when it has to build a INODE
>             HANDLE (which
>             happens as part of nameless lookup). The nameless lookup
>             suceeds and the
>             new inode is looked up in the new graph for the client.
>             But after that,
>             there are 2 more issues.
>
>             1) After successful completion of the nameless lookup, the
>             file has to
>             be opened in the new graph. So a syncop_open is sent on
>             the new graph
>             for the gfid. In posix_open, posix xlator again tries to
>             open the file
>             using the gfid handle. But since the gfid handle is
>             removed, open fails
>             and the file is not opened (thus fd migration fails
>             again.) We can
>             search the list of fds for the inode, find the right fd
>             that the fuse
>             client is trying to migrate and return that fd. But
>             searching the right
>             fd is a hard task. (What if a fuse client has opened 2 fds
>             with same flags?)
>
>         If there is more than one posix fd (fd opened on backend
>         filesystem) with same flags, its not really an issue. For our
>         purposes it doesn't make any difference. Within glusterfs
>         we'll anyways be using a different fd object (to maintain lock
>         state etc). At the posix level all we need is an fd opened
>         with correct flags. We can dup one of these (posix) fds and
>         associate the duped fd with glusterfs fd object. Please note
>         that returning glfs_fd_object (with a reference) won't work
>         here, since the glusterfs fd object we are migrating might
>         have different lock state than the one having posix fd opened
>         with same flags. We need to dup the posix fd and associate
>         that fd with a new glusterfs fd object.
>
>     Ok. Will see this method.
>
>             2) Another problem is open-behind. Fuse xlator after
>             nameless lookup,
>             sends syncop_open to migrate the fds. Once the syncop_open
>             is complete
>             and fds are migrated, PARENT_DOWN event is sent on the old
>             graph and the
>             client xlator sends release on all the fds (if the
>             previous syncop_open
>             is successful, then its safe to send release from old
>             graph as the new
>             fd would have been migrated to the new graph, with
>             corresponding fd
>             present in the brick). But before that in syncop_open,
>             open-behind might
>             have sent success to the fuse without actually winding the
>             open call to
>             the below xlators. Now fuse gets success for the open,
>             sends PARENT_DOWN
>             to old graph, which sends release on the fd. Thus even
>             though a fd is
>             present from application's point of view, there are no
>             mechanisms to
>             access the file (as the fds and gfid handles have been
>             removed already.)
>
>         Introduce a key in xdata "force-open" in open fop and if that
>         key is set, make open-behind to not to delay open.
>
>     But the problem is syncop_open () does not send any dictionary (it
>     will be NULL). We can make open-behind
>     check whether xdata is NULL and if so, consider that open call be
>     generated internally (not from application) and wind it to the
>     below xlator.
>
>
> Hmm.. I am not too sure whether we can rely on the interpretation that 
> xdata being  NULL means to force open in open-behind. There definitely 
> are/will be other use-cases of syncop-open where some might 
> inadvertently leave xdata NULL. It always helps in terms of 
> understandability, to be explicit on what we want to do. Can't you 
> create an xdata in fuse fd migration code and pass that down to 
> syncop-open?
Whoever calls syncop_open does not send the xdata as the arugement at 
all. It will be like this.
ret = syncop_open (new_subvol, &loc, flags, newfd);

The syncop framework itself sends the xdata as NULL while winding the 
call (making syncop framework allocate a new dict before winding and 
send it as an argument also wont work in this case, as fuse wont be able 
to set any new key).
>
>
>
>
>             Please provide feedback on the above issues.
>
>
>             Regards,
>             Raghavendra Bhat
>
>
>             _______________________________________________
>             Gluster-devel mailing list
>             Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
>             https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
>
>     _______________________________________________
>     Gluster-devel mailing list
>     Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
>     https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
>
>
> -- 
> Raghavendra G

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130807/6bb39bd4/attachment-0001.html>


More information about the Gluster-devel mailing list