[Gluster-devel] RFC/Review: libgfapi object handle based extensions

Mon Sep 30 16:34:04 UTC 2013

On Mon, Sep 30, 2013 at 3:40 AM, Shyamsundar Ranganathan <
srangana at redhat.com> wrote:

> Avati, Amar,
>
> Amar, Anand S and myself had a discussion on this comment and here is an
> answer to your queries the way I see it. Let me know if I am missing
> something here.
>
> (this is not a NFS Ganesha requirement, FYI. As Ganesha will only do a
> single lookup or preserve a single object handle per filesystem object in
> its cache)
>
> Currently a glfs_object is an opaque pointer to an object (it is a
> _handle_ to the object). The object itself contains a ref'd inode, which is
> the actual pointer to the object.
>
> 1) The similarity and differences of object handles to fds
>
> The intention of multiple object handles is in lines with multiple fd's
> per file, an application using the library is free to lookup (and/or create
> (and its equivalents)) and acquire as many object handles as it wants for a
> particular object, and can hence determine the lifetime of each such object
> in its view. So in essence one thread can have an object handle to perform,
> say attribute related operations, whereas another thread has the same
> object looked up to perform IO.
>

So do you mean a glfs_object is meant to be a *per-operation* handle? If
one thread wants to perform a chmod() and another thread wants to perform
chown() and both attempt to resolve the same name and end up getting
different handles, then both of them unref the glfs_handle right after
their operation?

> Where the object handles depart from the notion of fds is when an unlink
> is performed. As POSIX defines that open fds are still _open_ for
> activities on the file, the life of an fd and the actual object that it
> points to is till the fd is closed. In the case of object handles though,
> the moment any handle is used to unlink the object (which BTW is done using
> the parent object handle and the name of the child), all handles pointing
> to the object are still valid pointers, but operations on then will result
> in ENOENT, as the actual object has since been unlinked and removed by the
> underlying filesystem.
>

Not always. If the file had hardlinks the handle should still be valid. And
if there were no hardlinks and you unlinked the last link, further
operations must return ESTALE. ENOENT is when a "basename" does not resolve
to a handle (in entry operations) - for e.g when you try to unlink the same
entry a second time. Whereas ESTALE is when a presented handle does not
exist - for e.g when you try to operate (read, chmod) a handle which got
deleted.

> The departure from fds is considered valid in my perspective, as the
> handle points to an object, which has since been removed, and so there is
> no semantics here that needs it to be preserved for further operations as
> there is a reference to it held.
>

The departure is only in the behavior of unlinked files. That is orthogonal
to whether you want to return separate handles each time a component is
looked up. I fail to see how the "departure from fd behavior" justifies
creating new glfs_object per lookup?

> So in essence for each time an object handle is returned by the API, it
> has to be closed for its life to end. Additionally if the object that it
> points to is removed from the underlying system, the handle is pointing to
> an entry that does not exist any longer and returns ENOENT on operations
> using the same.
>
> 2) The issue/benefit of having the same object handle irrespective of
> looking it up multiple times
>
> If we have an 1-1 relationship of object handles (i.e struct glfs_object)
> to inodes, then the caller gets the same pointer to the handle. Hence
> having multiple handles as per the caller, boils down to giving out ref
> counted glfs_object(s) for the same inode.
>
> Other than the memory footprint, this will still not make the object live
> past it's unlink time. The pointer handed out will be still valid till the
> last ref count is removed (i.e the object handle closed), at which point
> the object handle can be destroyed.
>

If I understand what you say above correctly, you intend to solve the
problem of "unlinked files must return error" at your API layer? That's
wrong. The right way is to ref-count glfs_object and return them precisely
because you should NOT make the decision about the end of life of an inode
at that layer. A hardlink may have been created by another client and the
glfs_object may therefore be still be valid.

You are also returning separate glfs_object for different hardlinks of a
file. Does that mean glfs_object is representing a dentry? or a
per-operation reference to an inode?

So again, as many handles were handed out for the same inode, they have to
> be closed, etc.
>
> 3) Graph switches
>
> In the case of graph switches, handles that are used in operations post
> the switch, get refreshed with an inode from the new graph, if we have an
> N:1 object to inode relationship.
>
> In the case of 1:1 this is done once, but is there some multi thread
> safety that needs to be in place? I think this is already in place from the
> glfs_resolve_inode implementation as suggested earlier, but good to check.
>
> 4) Renames
>
> In the case of renames, the inode remains the same, hence all handed out
> object handles still are valid and will operate on the right object per se.
>
> 5) unlinks and recreation of the same _named_ object in the background
>
> Example being, application gets an handle for an object, say named
> "a.txt", and in the background (or via another application/client) this is
> deleted and recreated.
>
> This will return ENOENT as the GFID would have changed for the previously
> held object to the new one, even though the names are the same. This seems
> like the right behaviour, and does not change in the case of a 1:1 of an
> N:1 object handle to inode mapping.
>
> So bottom line, I see the object handles like an fd with the noted
> difference above. Having them in a 1:1 relationship or as a N:1
> relationship does not seem to be an issue from what I understand, what am I
> missing here?
>

The issue is this. From what I understand, the usage of glfs_object in the
FSAL is not like a per-operation handle, but something stored long term
(many minutes, hours, days) in the per-inode context of the NFS Ganesha
layer. Now NFS Ganesha may be doing the "right thing" by not re-looking up
an already looked up name and therefore avoiding a leak (I'm not so sure,
it still needs to verify every so often if the mapping is still valid).
>From NFS Ganesha's point of view the handle is changing on every lookup.

Now consider what happens in case of READDIRPLUS. A list of names and
handles are returned to the client. The list of names can possibly include
names which were previously looked up as well. Both are supposed to
represent the same "gfid", but here will be returning new glfs_objects.
When a client performs an operation on a GFID, on which glfs_object will
the operation be performed at the gfapi layer? This part seems very
ambiguous and not clear.

What would really help is if you can tell what a glfs_object is supposed to
represent? - an on disk inode (i.e GFID)? an in memory per-graph inode (i.e
inode_t)? A dentry? A per-operation handle to an on disk inode? A
per-operation handle to an in memory per-graph inode? A per operation
handle to a dentry? In the current form, it does not seem to fit any of the
these categories.

Avati

Shyam
>
>
> ------------------------------
>
> *From: *"Anand Avati" <avati at gluster.org>
> *To: *"Shyamsundar Ranganathan" <srangana at redhat.com>
> *Cc: *"Gluster Devel" <gluster-devel at nongnu.org>
> *Sent: *Monday, September 30, 2013 10:35:05 AM
>
> *Subject: *Re: RFC/Review: libgfapi object handle based extensions
>
> I see a pretty core issue - lifecycle management of 'struct glfs_object'.
> What is the structure representing? When is it created? When is it
> destroyed? How does it relate to inode_t?
>
> Looks like for every lookup() we are creating a new glfs_object, even if
> the looked up inode was already looked up before (in the cache) and had a
> glfs_object created for it in the recent past.
>
> We need a stronger relationship between the two with a clearer
> relationship. It is probably necessary for a glfs_object to represent
> mulitple inode_t's at different points in time depending on graph switches,
> but for a given inode_t we need only one glfs_object. We definitely must
> NOT have a new glfs_object per lookup call.
>
> Avati
>
> On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan <
> srangana at redhat.com> wrote:
>
>> Avati,
>>
>> Please find the updated patch set for review at gerrit.
>> http://review.gluster.org/#/c/5936/
>>
>> Changes made to address the points (1) (2) and (3) below. By the usage of
>> the suggested glfs_resolve_inode approach.
>>
>> I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more on
>> this a little later).
>>
>> So currently, the review request is for all APIs other than,
>> glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid
>>
>> glfs_resolve_at: Using this function the terminal name will be a force
>> look up anyway (as force_lookup will be passed as 1 based on
>> !next_component). We need to avoid this _extra_ lookup in the unlink case,
>> which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in
>> the first place.
>>
>> Having said the above, we should still leverage glfs_resolve_at anyway,
>> as there seem to be other corner cases where the resolved inode and subvol
>> maybe from different graphs. So I think I want to modify glfs_resolve_at to
>> make a conditional force_lookup, based on iatt being NULL or not. IOW,
>> change the call to glfs_resolve_component with the conditional as, (reval
>> || (!next_component && iatt)). So that callers that do not want the iatt
>> filled, can skip the syncop_lookup.
>>
>> Request comments on the glfs_resolve_at proposal.
>>
>> Shyam.
>>
>> ----- Original Message -----
>>
>> > From: "Anand Avati" <avati at gluster.org>
>> > To: "Shyamsundar Ranganathan" <srangana at redhat.com>
>> > Cc: "Gluster Devel" <gluster-devel at nongnu.org>
>> > Sent: Wednesday, September 18, 2013 11:39:27 AM
>> > Subject: Re: RFC/Review: libgfapi object handle based extensions
>>
>> > Minor comments are made in gerrit. Here is a larger (more important)
>> comment
>> > for which email is probably more convenient.
>>
>> > There is a problem in the general pattern of the fops, for example
>> > glfs_h_setattrs() (and others too)
>>
>> > 1. glfs_validate_inode() has the assumption that object->inode deref is
>> a
>> > guarded operation, but here we are doing an unguarded deref in the
>> paramter
>> > glfs_resolve_base().
>>
>> > 2. A more important issue, glfs_active_subvol() and
>> glfs_validate_inode() are
>> > not atomic. glfs_active_subvol() can return an xlator from one graph,
>> but by
>> > the time glfs_validate_inode() is called, a graph switch could have
>> happened
>> > and inode can get resolved to a different graph. And in syncop_XXXXXX()
>> we
>> > end up calling on graph1 with inode belonging to graph2.
>>
>> > 3. ESTALE_RETRY is a fundamentally wrong thing to do with handle based
>> > operations. The ESTALE_RETRY macro exists for path based FOPs where the
>> > resolved handle could have turned stale by the time we perform the FOP
>> > (where resolution and FOP are non-atomic). Over here, the handle is
>> > predetermined, and it does not make sense to retry on ESTALE (notice
>> that FD
>> > based fops in glfs-fops.c also do not have ESTALE_RETRY for this same
>> > reason)
>>
>> > I think the pattern should be similar to FD based fops which
>> specifically
>> > address both the above problems. Here's an outline:
>>
>> > glfs_h_XXXX(struct glfs *fs, glfs_object *object, ...)
>> > {
>> > xlator_t *subvol = NULL;
>> > inode_t *inode = NULL;
>>
>> > __glfs_entry_fs (fs);
>>
>> > subvol = glfs_active_subvol (fs);
>> > if (!subvol) { errno = EIO; ... goto out; }
>>
>> > inode = glfs_resolve_inode (fs, object, subvol);
>> > if (!inode) { errno = ESTALE; ... goto out; }
>>
>> > loc.inode = inode;
>> > ret = syncop_XXXX(subvol, &loc, ...);
>>
>> > }
>>
>> > Notice the signature of glfs_resolve_inode(). What it does: given a
>> > glfs_object, and a subvol, it returns an inode_t which is resolved on
>> that
>> > subvol. This way the syncop_XXX() is performed with matching subvol and
>> > inode. Also it returns the inode pointer so that no unsafe object->inode
>> > deref is done by the caller. Again, this is the same pattern followed
>> by the
>> > fd based fops already.
>>
>> > Also, as mentioned in one of the comments, please consider using
>> > glfs_resolve_at() and avoiding manual construction of loc_t.
>>
>> > Thanks,
>> > Avati
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130930/67d71684/attachment-0001.html>