[Gluster-devel] RFC/Review: libgfapi object handle based extensions

Tue Oct 1 02:08:05 UTC 2013

----- Original Message ----- 

> From: "Anand Avati" <avati at gluster.org>
> To: "Shyamsundar Ranganathan" <srangana at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at nongnu.org>
> Sent: Monday, September 30, 2013 10:04:04 PM
> Subject: Re: RFC/Review: libgfapi object handle based extensions

> On Mon, Sep 30, 2013 at 3:40 AM, Shyamsundar Ranganathan <
> srangana at redhat.com > wrote:

> > Avati, Amar,
> 

> > Amar, Anand S and myself had a discussion on this comment and here is an
> > answer to your queries the way I see it. Let me know if I am missing
> > something here.
> 

> > (this is not a NFS Ganesha requirement, FYI. As Ganesha will only do a
> > single
> > lookup or preserve a single object handle per filesystem object in its
> > cache)
> 

> > Currently a glfs_object is an opaque pointer to an object (it is a _handle_
> > to the object). The object itself contains a ref'd inode, which is the
> > actual pointer to the object.
> 

> > 1) The similarity and differences of object handles to fds
> 

> > The intention of multiple object handles is in lines with multiple fd's per
> > file, an application using the library is free to lookup (and/or create
> > (and
> > its equivalents)) and acquire as many object handles as it wants for a
> > particular object, and can hence determine the lifetime of each such object
> > in its view. So in essence one thread can have an object handle to perform,
> > say attribute related operations, whereas another thread has the same
> > object
> > looked up to perform IO.
> 

> So do you mean a glfs_object is meant to be a *per-operation* handle? If one
> thread wants to perform a chmod() and another thread wants to perform
> chown() and both attempt to resolve the same name and end up getting
> different handles, then both of them unref the glfs_handle right after their
> operation?

Both of them could unref, or could hold it in a cache etc. as they choose, but in short yes.

> > Where the object handles depart from the notion of fds is when an unlink is
> > performed. As POSIX defines that open fds are still _open_ for activities
> > on
> > the file, the life of an fd and the actual object that it points to is till
> > the fd is closed. In the case of object handles though, the moment any
> > handle is used to unlink the object (which BTW is done using the parent
> > object handle and the name of the child), all handles pointing to the
> > object
> > are still valid pointers, but operations on then will result in ENOENT, as
> > the actual object has since been unlinked and removed by the underlying
> > filesystem.
> 

> Not always. If the file had hardlinks the handle should still be valid. And
> if there were no hardlinks and you unlinked the last link, further

Agreed on the hardlinks and failures there would not happen, i.e the handle would still be valid (as internally the inode is valid)

> operations must return ESTALE. ENOENT is when a "basename" does not resolve
> to a handle (in entry operations) - for e.g when you try to unlink the same
> entry a second time. Whereas ESTALE is when a presented handle does not
> exist - for e.g when you try to operate (read, chmod) a handle which got
> deleted.

> > The departure from fds is considered valid in my perspective, as the handle
> > points to an object, which has since been removed, and so there is no
> > semantics here that needs it to be preserved for further operations as
> > there
> > is a reference to it held.
> 

> The departure is only in the behavior of unlinked files. That is orthogonal
> to whether you want to return separate handles each time a component is
> looked up. I fail to see how the "departure from fd behavior" justifies
> creating new glfs_object per lookup?

The departure does not justify the need for a separate handle each time, hence point (2) below. What i mean to say is point (1) is not justification for a separate handle each time, just the behaviour.

> > So in essence for each time an object handle is returned by the API, it has
> > to be closed for its life to end. Additionally if the object that it points
> > to is removed from the underlying system, the handle is pointing to an
> > entry
> > that does not exist any longer and returns ENOENT on operations using the
> > same.
> 

> > 2) The issue/benefit of having the same object handle irrespective of
> > looking
> > it up multiple times
> 

> > If we have an 1-1 relationship of object handles (i.e struct glfs_object)
> > to
> > inodes, then the caller gets the same pointer to the handle. Hence having
> > multiple handles as per the caller, boils down to giving out ref counted
> > glfs_object(s) for the same inode.
> 

> > Other than the memory footprint, this will still not make the object live
> > past it's unlink time. The pointer handed out will be still valid till the
> > last ref count is removed (i.e the object handle closed), at which point
> > the
> > object handle can be destroyed.
> 

> If I understand what you say above correctly, you intend to solve the problem
> of "unlinked files must return error" at your API layer? That's wrong. The
> right way is to ref-count glfs_object and return them precisely because you
> should NOT make the decision about the end of life of an inode at that
> layer. A hardlink may have been created by another client and the
> glfs_object may therefore be still be valid.

The unlinked files returning an error is not at the API layer, rather the object it points to when operated upon (mostly a syncop) would return failure stating ENOENT. So the decision is not at the API layer at all for unlinked files, it is just the behaviour.

> You are also returning separate glfs_object for different hardlinks of a
> file. Does that mean glfs_object is representing a dentry? or a
> per-operation reference to an inode?

The glfs_object is a handle to an object in the filesystem, it does not point to something, it is just an handle. But I think we have more of this covered in the next couple of mails :)

> > So again, as many handles were handed out for the same inode, they have to
> > be
> > closed, etc.
> 

> > 3) Graph switches
> 

> > In the case of graph switches, handles that are used in operations post the
> > switch, get refreshed with an inode from the new graph, if we have an N:1
> > object to inode relationship.
> 

> > In the case of 1:1 this is done once, but is there some multi thread safety
> > that needs to be in place? I think this is already in place from the
> > glfs_resolve_inode implementation as suggested earlier, but good to check.
> 

> > 4) Renames
> 

> > In the case of renames, the inode remains the same, hence all handed out
> > object handles still are valid and will operate on the right object per se.
> 

> > 5) unlinks and recreation of the same _named_ object in the background
> 

> > Example being, application gets an handle for an object, say named "a.txt",
> > and in the background (or via another application/client) this is deleted
> > and recreated.
> 

> > This will return ENOENT as the GFID would have changed for the previously
> > held object to the new one, even though the names are the same. This seems
> > like the right behaviour, and does not change in the case of a 1:1 of an
> > N:1
> > object handle to inode mapping.
> 

> > So bottom line, I see the object handles like an fd with the noted
> > difference
> > above. Having them in a 1:1 relationship or as a N:1 relationship does not
> > seem to be an issue from what I understand, what am I missing here?
> 

> The issue is this. From what I understand, the usage of glfs_object in the
> FSAL is not like a per-operation handle, but something stored long term
> (many minutes, hours, days) in the per-inode context of the NFS Ganesha
> layer. Now NFS Ganesha may be doing the "right thing" by not re-looking up
> an already looked up name and therefore avoiding a leak (I'm not so sure, it
> still needs to verify every so often if the mapping is still valid). From
> NFS Ganesha's point of view the handle is changing on every lookup.

FYI, Ganesha does do a getattrs/stat if it finds the cache entry elapsed based on internal cache timeouts, in these cases the handle is not new, it is the same old handle that is used to refresh using the said APIs.

> Now consider what happens in case of READDIRPLUS. A list of names and handles
> are returned to the client. The list of names can possibly include names
> which were previously looked up as well. Both are supposed to represent the
> same "gfid", but here will be returning new glfs_objects. When a client
> performs an operation on a GFID, on which glfs_object will the operation be
> performed at the gfapi layer? This part seems very ambiguous and not clear.

I should have made a note for readdirplus earlier, this would default to the fd based version of the same, not a handle/object based version of the same. So we would transition from an handle to an fd via glfs_h_opendir and then continue with the readdir variants. if I look at the POSIX *at routines, this seem about right, but of course we may have variances here.

> What would really help is if you can tell what a glfs_object is supposed to
> represent? - an on disk inode (i.e GFID)? an in memory per-graph inode (i.e
> inode_t)? A dentry? A per-operation handle to an on disk inode? A
> per-operation handle to an in memory per-graph inode? A per operation handle
> to a dentry? In the current form, it does not seem to fit any of the these
> categories.

Well I think of it as a handle to an file system object. Having said that, if we just returned the inode pointer as this handle, the graph switches can cause a problem, in which case we need to default to the (as per my understanding) the FUSE manner of working. keeping the handle 1:1 via other infrastructure does not seem beneficial ATM. I think you cover this in the subsequent mail so let us continue there.

Shyam