[Gluster-devel] RFC/Review: libgfapi object handle based extensions

Anand Avati avati at gluster.org
Mon Sep 30 19:49:09 UTC 2013


On Mon, Sep 30, 2013 at 9:34 AM, Anand Avati <avati at gluster.org> wrote:

>
>
>
> On Mon, Sep 30, 2013 at 3:40 AM, Shyamsundar Ranganathan <
> srangana at redhat.com> wrote:
>
>> Avati, Amar,
>>
>> Amar, Anand S and myself had a discussion on this comment and here is an
>> answer to your queries the way I see it. Let me know if I am missing
>> something here.
>>
>> (this is not a NFS Ganesha requirement, FYI. As Ganesha will only do a
>> single lookup or preserve a single object handle per filesystem object in
>> its cache)
>>
>> Currently a glfs_object is an opaque pointer to an object (it is a
>> _handle_ to the object). The object itself contains a ref'd inode, which is
>> the actual pointer to the object.
>>
>> 1) The similarity and differences of object handles to fds
>>
>> The intention of multiple object handles is in lines with multiple fd's
>> per file, an application using the library is free to lookup (and/or create
>> (and its equivalents)) and acquire as many object handles as it wants for a
>> particular object, and can hence determine the lifetime of each such object
>> in its view. So in essence one thread can have an object handle to perform,
>> say attribute related operations, whereas another thread has the same
>> object looked up to perform IO.
>>
>
> So do you mean a glfs_object is meant to be a *per-operation* handle? If
> one thread wants to perform a chmod() and another thread wants to perform
> chown() and both attempt to resolve the same name and end up getting
> different handles, then both of them unref the glfs_handle right after
> their operation?
>
>
>
>> Where the object handles depart from the notion of fds is when an unlink
>> is performed. As POSIX defines that open fds are still _open_ for
>> activities on the file, the life of an fd and the actual object that it
>> points to is till the fd is closed. In the case of object handles though,
>> the moment any handle is used to unlink the object (which BTW is done using
>> the parent object handle and the name of the child), all handles pointing
>> to the object are still valid pointers, but operations on then will result
>> in ENOENT, as the actual object has since been unlinked and removed by the
>> underlying filesystem.
>>
>
> Not always. If the file had hardlinks the handle should still be valid.
> And if there were no hardlinks and you unlinked the last link, further
> operations must return ESTALE. ENOENT is when a "basename" does not resolve
> to a handle (in entry operations) - for e.g when you try to unlink the same
> entry a second time. Whereas ESTALE is when a presented handle does not
> exist - for e.g when you try to operate (read, chmod) a handle which got
> deleted.
>
>
>> The departure from fds is considered valid in my perspective, as the
>> handle points to an object, which has since been removed, and so there is
>> no semantics here that needs it to be preserved for further operations as
>> there is a reference to it held.
>>
>
> The departure is only in the behavior of unlinked files. That is
> orthogonal to whether you want to return separate handles each time a
> component is looked up. I fail to see how the "departure from fd behavior"
> justifies creating new glfs_object per lookup?
>
>
>> So in essence for each time an object handle is returned by the API, it
>> has to be closed for its life to end. Additionally if the object that it
>> points to is removed from the underlying system, the handle is pointing to
>> an entry that does not exist any longer and returns ENOENT on operations
>> using the same.
>>
>> 2) The issue/benefit of having the same object handle irrespective of
>> looking it up multiple times
>>
>> If we have an 1-1 relationship of object handles (i.e struct glfs_object)
>> to inodes, then the caller gets the same pointer to the handle. Hence
>> having multiple handles as per the caller, boils down to giving out ref
>> counted glfs_object(s) for the same inode.
>>
>> Other than the memory footprint, this will still not make the object live
>> past it's unlink time. The pointer handed out will be still valid till the
>> last ref count is removed (i.e the object handle closed), at which point
>> the object handle can be destroyed.
>>
>
> If I understand what you say above correctly, you intend to solve the
> problem of "unlinked files must return error" at your API layer? That's
> wrong. The right way is to ref-count glfs_object and return them precisely
> because you should NOT make the decision about the end of life of an inode
> at that layer. A hardlink may have been created by another client and the
> glfs_object may therefore be still be valid.
>
> You are also returning separate glfs_object for different hardlinks of a
> file. Does that mean glfs_object is representing a dentry? or a
> per-operation reference to an inode?
>
> So again, as many handles were handed out for the same inode, they have to
>> be closed, etc.
>>
>> 3) Graph switches
>>
>> In the case of graph switches, handles that are used in operations post
>> the switch, get refreshed with an inode from the new graph, if we have an
>> N:1 object to inode relationship.
>>
>> In the case of 1:1 this is done once, but is there some multi thread
>> safety that needs to be in place? I think this is already in place from the
>> glfs_resolve_inode implementation as suggested earlier, but good to check.
>>
>> 4) Renames
>>
>> In the case of renames, the inode remains the same, hence all handed out
>> object handles still are valid and will operate on the right object per se.
>>
>> 5) unlinks and recreation of the same _named_ object in the background
>>
>> Example being, application gets an handle for an object, say named
>> "a.txt", and in the background (or via another application/client) this is
>> deleted and recreated.
>>
>> This will return ENOENT as the GFID would have changed for the previously
>> held object to the new one, even though the names are the same. This seems
>> like the right behaviour, and does not change in the case of a 1:1 of an
>> N:1 object handle to inode mapping.
>>
>> So bottom line, I see the object handles like an fd with the noted
>> difference above. Having them in a 1:1 relationship or as a N:1
>> relationship does not seem to be an issue from what I understand, what am I
>> missing here?
>>
>
> The issue is this. From what I understand, the usage of glfs_object in the
> FSAL is not like a per-operation handle, but something stored long term
> (many minutes, hours, days) in the per-inode context of the NFS Ganesha
> layer. Now NFS Ganesha may be doing the "right thing" by not re-looking up
> an already looked up name and therefore avoiding a leak (I'm not so sure,
> it still needs to verify every so often if the mapping is still valid).
> From NFS Ganesha's point of view the handle is changing on every lookup.
>
> Now consider what happens in case of READDIRPLUS. A list of names and
> handles are returned to the client. The list of names can possibly include
> names which were previously looked up as well. Both are supposed to
> represent the same "gfid", but here will be returning new glfs_objects.
> When a client performs an operation on a GFID, on which glfs_object will
> the operation be performed at the gfapi layer? This part seems very
> ambiguous and not clear.
>
> What would really help is if you can tell what a glfs_object is supposed
> to represent? - an on disk inode (i.e GFID)? an in memory per-graph inode
> (i.e inode_t)? A dentry? A per-operation handle to an on disk inode? A
> per-operation handle to an in memory per-graph inode? A per operation
> handle to a dentry? In the current form, it does not seem to fit any of the
> these categories.
>
>
> Avati
>


After giving this some more thought, I feel the cleanest way is to make
inode_t and inode table graph aware. This way for a given GFID there will
be one and only one inode_t at a given time no matter how many graphs are
switched. It is also worth noting that relationship between two GFIDs does
not change with a graph switch, so having a separate inode table with
duplicate inodes and dentries has always been redundant in a way. The
initial decision to have separate inode table per graph was done because
inode table was bound to an xlator_t (which in turn was bound to a graph).

If we make inode_t and inode table multi-graph aware, the same inode_t
would be valid on a new graph. We would need new code to keep track of the
latest graph on which a given inode has been "initialized / discovered" in
order to force a discover() on new graph if necessary (dentry relations
would just continue to be valid), and after a new graph switch, to force
cleanup of xlators from old graph.

This way, from your layer, glfs_object would be an alias to inode_t and
internally you can just typedef it. This would also require some changes in
the resolver code (in both fuse and glfs-resolve) to handle graph switches
and fd migration in the new way, but I imagine the new way will be much
simpler/cleaner than current approach anyways. It would also solve the
"handles returned in readdirplus" issue too.

Another reason why I prefer this new approach is, making inode_t graph
independent makes old graph destruction completely "in our control",
without having to depend on /force fuse to issue FORGET on inode_ts from
the old graph. That entire problem gets eliminated as inode_ts would now be
graph independent.

(copying Raghavendra Bhat who is performing graph destruction work and Amar)

Thoughts?
Avati




>
>
> Shyam
>>
>>
>> ------------------------------
>>
>> *From: *"Anand Avati" <avati at gluster.org>
>> *To: *"Shyamsundar Ranganathan" <srangana at redhat.com>
>> *Cc: *"Gluster Devel" <gluster-devel at nongnu.org>
>> *Sent: *Monday, September 30, 2013 10:35:05 AM
>>
>> *Subject: *Re: RFC/Review: libgfapi object handle based extensions
>>
>> I see a pretty core issue - lifecycle management of 'struct glfs_object'.
>> What is the structure representing? When is it created? When is it
>> destroyed? How does it relate to inode_t?
>>
>> Looks like for every lookup() we are creating a new glfs_object, even if
>> the looked up inode was already looked up before (in the cache) and had a
>> glfs_object created for it in the recent past.
>>
>> We need a stronger relationship between the two with a clearer
>> relationship. It is probably necessary for a glfs_object to represent
>> mulitple inode_t's at different points in time depending on graph switches,
>> but for a given inode_t we need only one glfs_object. We definitely must
>> NOT have a new glfs_object per lookup call.
>>
>> Avati
>>
>> On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan <
>> srangana at redhat.com> wrote:
>>
>>> Avati,
>>>
>>> Please find the updated patch set for review at gerrit.
>>> http://review.gluster.org/#/c/5936/
>>>
>>> Changes made to address the points (1) (2) and (3) below. By the usage
>>> of the suggested glfs_resolve_inode approach.
>>>
>>> I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more
>>> on this a little later).
>>>
>>> So currently, the review request is for all APIs other than,
>>> glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid
>>>
>>> glfs_resolve_at: Using this function the terminal name will be a force
>>> look up anyway (as force_lookup will be passed as 1 based on
>>> !next_component). We need to avoid this _extra_ lookup in the unlink case,
>>> which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in
>>> the first place.
>>>
>>> Having said the above, we should still leverage glfs_resolve_at anyway,
>>> as there seem to be other corner cases where the resolved inode and subvol
>>> maybe from different graphs. So I think I want to modify glfs_resolve_at to
>>> make a conditional force_lookup, based on iatt being NULL or not. IOW,
>>> change the call to glfs_resolve_component with the conditional as, (reval
>>> || (!next_component && iatt)). So that callers that do not want the iatt
>>> filled, can skip the syncop_lookup.
>>>
>>> Request comments on the glfs_resolve_at proposal.
>>>
>>> Shyam.
>>>
>>> ----- Original Message -----
>>>
>>> > From: "Anand Avati" <avati at gluster.org>
>>> > To: "Shyamsundar Ranganathan" <srangana at redhat.com>
>>> > Cc: "Gluster Devel" <gluster-devel at nongnu.org>
>>> > Sent: Wednesday, September 18, 2013 11:39:27 AM
>>> > Subject: Re: RFC/Review: libgfapi object handle based extensions
>>>
>>> > Minor comments are made in gerrit. Here is a larger (more important)
>>> comment
>>> > for which email is probably more convenient.
>>>
>>> > There is a problem in the general pattern of the fops, for example
>>> > glfs_h_setattrs() (and others too)
>>>
>>> > 1. glfs_validate_inode() has the assumption that object->inode deref
>>> is a
>>> > guarded operation, but here we are doing an unguarded deref in the
>>> paramter
>>> > glfs_resolve_base().
>>>
>>> > 2. A more important issue, glfs_active_subvol() and
>>> glfs_validate_inode() are
>>> > not atomic. glfs_active_subvol() can return an xlator from one graph,
>>> but by
>>> > the time glfs_validate_inode() is called, a graph switch could have
>>> happened
>>> > and inode can get resolved to a different graph. And in
>>> syncop_XXXXXX() we
>>> > end up calling on graph1 with inode belonging to graph2.
>>>
>>> > 3. ESTALE_RETRY is a fundamentally wrong thing to do with handle based
>>> > operations. The ESTALE_RETRY macro exists for path based FOPs where the
>>> > resolved handle could have turned stale by the time we perform the FOP
>>> > (where resolution and FOP are non-atomic). Over here, the handle is
>>> > predetermined, and it does not make sense to retry on ESTALE (notice
>>> that FD
>>> > based fops in glfs-fops.c also do not have ESTALE_RETRY for this same
>>> > reason)
>>>
>>> > I think the pattern should be similar to FD based fops which
>>> specifically
>>> > address both the above problems. Here's an outline:
>>>
>>> > glfs_h_XXXX(struct glfs *fs, glfs_object *object, ...)
>>> > {
>>> > xlator_t *subvol = NULL;
>>> > inode_t *inode = NULL;
>>>
>>> > __glfs_entry_fs (fs);
>>>
>>> > subvol = glfs_active_subvol (fs);
>>> > if (!subvol) { errno = EIO; ... goto out; }
>>>
>>> > inode = glfs_resolve_inode (fs, object, subvol);
>>> > if (!inode) { errno = ESTALE; ... goto out; }
>>>
>>> > loc.inode = inode;
>>> > ret = syncop_XXXX(subvol, &loc, ...);
>>>
>>> > }
>>>
>>> > Notice the signature of glfs_resolve_inode(). What it does: given a
>>> > glfs_object, and a subvol, it returns an inode_t which is resolved on
>>> that
>>> > subvol. This way the syncop_XXX() is performed with matching subvol and
>>> > inode. Also it returns the inode pointer so that no unsafe
>>> object->inode
>>> > deref is done by the caller. Again, this is the same pattern followed
>>> by the
>>> > fd based fops already.
>>>
>>> > Also, as mentioned in one of the comments, please consider using
>>> > glfs_resolve_at() and avoiding manual construction of loc_t.
>>>
>>> > Thanks,
>>> > Avati
>>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130930/59b997a9/attachment-0001.html>


More information about the Gluster-devel mailing list