[Gluster-devel] handling open fds and graph switches

Thu Aug 8 02:24:41 UTC 2013

----- Original Message -----
> From: "Raghavendra Bhat" <rabhat at redhat.com>
> To: "Raghavendra G" <raghavendra at gluster.com>
> Cc: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "gluster-devel" <gluster-devel at nongnu.org>
> Sent: Wednesday, August 7, 2013 11:23:42 AM
> Subject: Re: [Gluster-devel] handling open fds and graph switches
> 
> On 08/07/2013 05:11 AM, Raghavendra G wrote:
> >
> > On Wed, Aug 7, 2013 at 12:21 AM, Raghavendra Bhat <rabhat at redhat.com
> > <mailto:rabhat at redhat.com>> wrote:
> >
> >     On 08/06/2013 05:22 PM, Raghavendra Gowdappa wrote:
> >
> >
> >         ----- Original Message -----
> >
> >             From: "Raghavendra Bhat" <rabhat at redhat.com
> >             <mailto:rabhat at redhat.com>>
> >             To: gluster-devel at nongnu.org <mailto:gluster-devel at nongnu.org>
> >             Sent: Tuesday, August 6, 2013 1:52:40 PM
> >             Subject: [Gluster-devel] handling open fds and graph switches
> >
> >
> >             Hi,
> >
> >             As of now, there is a problem when following set of
> >             operations are
> >             performed on a file.
> >
> >             open () => unlink () => do a graph change (not
> >             reconfigure) => fop on
> >             the opened fd (may be write)
> >
> >             In the above set of operations, the fop performed on the
> >             fd after the
> >             graph switch fails with EBADFD (which should not happen).
> >             Its because
> >             when the file is unlinked (assume there are no other
> >             hardlinks for the
> >             file), the gfid handle present in the .glusterfs directory
> >             of the brick
> >             is removed. Now when graph change happens, all fds have to
> >             be migrated
> >             to the new graph. Before that a nameless lookup will be
> >             sent on the gfid
> >             (to build the new inode in the new graph). The nameless
> >             lookup happens
> >             on the gfid handle. But since the gfid handle is removed
> >             upon receiving
> >             the unlink, nameless lookup fails, thus failing the fd
> >             migration to the
> >             new graph and the fops on the fd are also failed.
> >
> >             A patch has been sent to handle
> >             this(http://review.gluster.org/#/c/5428/), where the gfid
> >             handle is
> >             removed when the last reference to the file is removed
> >             (i.e upon getting
> >             the unlink, it also checks whether there are any open fds
> >             on the inode.
> >             If so, then the gfid handle is not removed. Its removed
> >             when release on
> >             that fd is received). But that approach might lead to gfid
> >             handle leaks
> >             (what if glusterfsd crashes upon unlinking the last entry?
> >             the gfid
> >             handle might not have been removed if there are open fds.
> >             And now if
> >             glusterfsd crashes, then the gfid handle for that file is
> >             leaked).
> >
> >             Another approach might be to make posix_lookup do a stat
> >             on one of the
> >             fds present on the inode when it has to build a INODE
> >             HANDLE (which
> >             happens as part of nameless lookup). The nameless lookup
> >             suceeds and the
> >             new inode is looked up in the new graph for the client.
> >             But after that,
> >             there are 2 more issues.
> >
> >             1) After successful completion of the nameless lookup, the
> >             file has to
> >             be opened in the new graph. So a syncop_open is sent on
> >             the new graph
> >             for the gfid. In posix_open, posix xlator again tries to
> >             open the file
> >             using the gfid handle. But since the gfid handle is
> >             removed, open fails
> >             and the file is not opened (thus fd migration fails
> >             again.) We can
> >             search the list of fds for the inode, find the right fd
> >             that the fuse
> >             client is trying to migrate and return that fd. But
> >             searching the right
> >             fd is a hard task. (What if a fuse client has opened 2 fds
> >             with same flags?)
> >
> >         If there is more than one posix fd (fd opened on backend
> >         filesystem) with same flags, its not really an issue. For our
> >         purposes it doesn't make any difference. Within glusterfs
> >         we'll anyways be using a different fd object (to maintain lock
> >         state etc). At the posix level all we need is an fd opened
> >         with correct flags. We can dup one of these (posix) fds and
> >         associate the duped fd with glusterfs fd object. Please note
> >         that returning glfs_fd_object (with a reference) won't work
> >         here, since the glusterfs fd object we are migrating might
> >         have different lock state than the one having posix fd opened
> >         with same flags. We need to dup the posix fd and associate
> >         that fd with a new glusterfs fd object.
> >
> >     Ok. Will see this method.
> >
> >             2) Another problem is open-behind. Fuse xlator after
> >             nameless lookup,
> >             sends syncop_open to migrate the fds. Once the syncop_open
> >             is complete
> >             and fds are migrated, PARENT_DOWN event is sent on the old
> >             graph and the
> >             client xlator sends release on all the fds (if the
> >             previous syncop_open
> >             is successful, then its safe to send release from old
> >             graph as the new
> >             fd would have been migrated to the new graph, with
> >             corresponding fd
> >             present in the brick). But before that in syncop_open,
> >             open-behind might
> >             have sent success to the fuse without actually winding the
> >             open call to
> >             the below xlators. Now fuse gets success for the open,
> >             sends PARENT_DOWN
> >             to old graph, which sends release on the fd. Thus even
> >             though a fd is
> >             present from application's point of view, there are no
> >             mechanisms to
> >             access the file (as the fds and gfid handles have been
> >             removed already.)
> >
> >         Introduce a key in xdata "force-open" in open fop and if that
> >         key is set, make open-behind to not to delay open.
> >
> >     But the problem is syncop_open () does not send any dictionary (it
> >     will be NULL). We can make open-behind
> >     check whether xdata is NULL and if so, consider that open call be
> >     generated internally (not from application) and wind it to the
> >     below xlator.
> >
> >
> > Hmm.. I am not too sure whether we can rely on the interpretation that
> > xdata being  NULL means to force open in open-behind. There definitely
> > are/will be other use-cases of syncop-open where some might
> > inadvertently leave xdata NULL. It always helps in terms of
> > understandability, to be explicit on what we want to do. Can't you
> > create an xdata in fuse fd migration code and pass that down to
> > syncop-open?
> Whoever calls syncop_open does not send the xdata as the arugement at
> all. It will be like this.
> ret = syncop_open (new_subvol, &loc, flags, newfd);
> 
> The syncop framework itself sends the xdata as NULL while winding the
> call (making syncop framework allocate a new dict before winding and
> send it as an argument also wont work in this case, as fuse wont be able
> to set any new key).

Since syncops are synchronous counterparts of asynchronous fops, I think we can add an xdata as the argument. How about adding an xdata argument to syncops just the way each fop does?

Others,
Do you've any comments or reservations on this?

> >
> >
> >
> >
> >             Please provide feedback on the above issues.
> >
> >
> >             Regards,
> >             Raghavendra Bhat
> >
> >
> >             _______________________________________________
> >             Gluster-devel mailing list
> >             Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
> >             https://lists.nongnu.org/mailman/listinfo/gluster-devel
> >
> >
> >
> >     _______________________________________________
> >     Gluster-devel mailing list
> >     Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
> >     https://lists.nongnu.org/mailman/listinfo/gluster-devel
> >
> >
> >
> >
> > --
> > Raghavendra G
> 
>