[Gluster-devel] Fwd: Re: RFC: Using anonymous fds in quick-read

Wed Sep 5 02:44:03 UTC 2012

On Mon, Sep 3, 2012 at 6:10 AM, Raghavendra G <raghavendra.hg at gmail.com>wrote:

>
>>> Following are the questions/thoughts related to anonymous fd framework
>>> and their usage in quick read. Please answer or give your feedback.
>>>
>>> Questions related to anonymous fd framework:
>>> ==============================**==============
>>> * Anonymous fds can work because open in itself doesn't do any primary
>>> task application is interested in - like read, write etc (application does
>>> an open with an intent of doing something else). This brings in the
>>> question, why do we need open at all, can't we eliminate it altogether? If
>>> we were to eliminate open, aren't we moving from a neater to   a messy
>>> design - each fop has to check whether the work associated with open (like
>>> storing contexts etc) is done in every invocation?
>>>
>>
>> Some corrections to the above statement. There are two parts to the
>> open() call
>>
>> 1) The effects of the call itself. Like
>> a) Perform permission checks and establish a 'session' (with the fd) on
>> the allowed permission [even if permission of the inode changes in the
>> future while the fd is still open]
>> b) Perform additional operation like file truncation when flag O_TRUNC
>> is specified
>>
>> 2) Side effects of the call, like
>> a) Specify the cache effects on future syscalls with O_[RD]SYNC,
>> O_DIRECT flags
>> b) Offer immunity against future calls like rename() and unlink()
>>
>> These are the kind of things even Gluster (or any other FS) has to
>> guarantee with its open() syscall.
>>
>> Anonymous fds exist because
>> a) Protocols like NFS3 do not support the above semantics and they are
>> implemented completely in the client side. But we require an fd_t
>> parameter in the read/write fops which also do not require some of the
>> above semantics (like read/write perm checks) and other semantics are
>> guaranteed by anonymous fds already (like immunity against rename()).
>> Note that immunity against unlink() is currently not existing in
>> anonymous fds.
>>
>> b) Internal optimizations in perf xlators do not require all the above
>> semantics sometimes.
>>
>> Whether we use anonymous FDs or not, we need to keep up all the above
>> semantics. There are some issues with the semantics even in today's
>> version of quick-read - we assume permission check has already happened
>> (which is usually true as FUSE performs permission checks) - but that
>> may not be the case always. That apart, the benefit of anonymous fds in
>> quick-read can be in handling of fd based fops in the window of time
>> between a short-cutt'ed open() and its completion from the backend. They
>> need not wait for the open() completion if they arrive early. Instead
>> they can proceed with an anonymous fd -- which can significantly reduce
>> code complexity.
>
>
>
>
>> Again, this can be limited to O_RDONLY +
>> ~O_DIRECT|O_TRUNC flag'ed open()s
>
>
> Why is this restriction? Can you elaborate on that?
>

Why, isn't it obvious? If there is an open() with O_TRUNC flag, then we
have to make sure the file is truncated before we complete the open()
system call. Similarly if there is O_DIRECT along with O_RDWR or O_WRONLY,
then we need to make sure all layers understand in case a write arrives
before the actual open() completes from the backend.

>
>
>> and thereby only be vulnerable to
>> unlink()s happening in that window.
>>
>
> Irrespective of anonymous fds, quick read would be vulnerable to unlinks
> in the window bounded by open returning in application and open actually
> happening in backend. I am not seeing how anonymous fds alter this
> situation. Can you please explain?
>

Correct, that's exactly what I said - irrespective of anonymous FDs we are
vulnerable to unlinks hitting the race between the short-cut'ed open() and
read(). Anonymous FDs does not attempt to solve that problem. We are still
vulnerable to unlink()s happening in the window (but that is the only
vulnerability left).

Is this true even for fsync operation on backend filesystems? Does fsync
> flush changes across all fds opened on a file?
>

It would be very tricky (I imagine impossible) to implement fsync() only
sync operations only from an FD and still offer ordering guarantees! The
very first sentence of man fsync states this pretty explicitly -

"fsync() transfers ("flushes") *all modified in-core data of (i.e.,
modified buffer cache pages for) the file* referred to by the file
descriptor fd to the disk device (or other permanent storage device) where
that file resides."

>
>
>>  * Though we are trying to decouple path from adressing an inode in
>>> glusterfs using nameless lookups, that decoupling is not complete. There
>>> are translators which use naming patterns to assign priorities to file
>>> (like io-cache, quick-read for the purposes of deciding whether to flush a
>>> cache or not). To be honest, the problem is seen only in fd-migration where
>>> we are using nameless lookups - for fresh lookups - in new graph, after a
>>> graph switch. Currently I am using nameless lookups with loc.path set,
>>> which solves the problem. Ideally nameless lookups are not the ones  to be
>>> used during migration, since they are not meant to be used for fresh
>>> lookups (atleast till we get rid of dependencies on path based
>>> addressability internally in glusterfs). However, they have huge
>>> performance beniefits.
>>>
>>
>> Not sure what the above point is w.r.t anonymous fds,
>
>
> Nothing related to anonymous fds themselves, but to their usage during fd
> migration after graph switch. After a graph switch, the first lookup in new
> graph is fresh one and translators like io-cache, quick-read, quota that
> make use of path information for their internal workings will be in
> trouble, if we don't have correct path in loc.path.
>
>
> but yes - nameless
>> lookup takes away the sense of hierarchy (and "filename") and operations
>> which depend on filename or hierarchy might not always work. But then
>> this has been true even before we brought in nameless lookups as FUSE
>> issues open() on an inode and therefore we are not guaranteed to perform
>> open() on the right path when you have hardlinks.
>>
>>  Using anonymous fd framework in quick-read:
>>> ==============================**=============
>>> * as far as quick read goes, its task becomes very simple. Just convert
>>> the fd to anonymous during open and return. It can eliminate all the
>>> dependencies of fops having to wait till open is actually done. In fact the
>>> fops it has to implement are: lookup, open and readv.
>>>
>>
>> Look at my previous comments, it must perform a little more checks.
>> quick-read cannot just "convert" an fd to anonymous fd. Anonymous fd has
>> fd->pid == -1 (which a quick-unwound open() fd will not). There are also
>> other semantics which need to be met (at least with best effort) while
>> the actual fd is still unopened.
>>
>>
>>> * Anonymous fd awareness should be brought in afr. it shouldn't try to
>>> open the files in fops like writev if fd happens to be anonymous.
>>>
>>
>> I think that already is the case. Also, why do you specifically mention
>> afr?
>>
>
> I was thinking in terms of using anonymous fds in quick-read, without
> having to open the file explicitly at all by delegating that responsibility
> (of open) to servers.
>

I think this is wrong expectation. I was thinking of using anonymous fds
only for the duration when the actual open() has not yet returned from the
server when the next operation arrives.

> Hence, I thought afr need not worry about opening the files. However, this
> may not work as you've explained earlier and I need to think over it.
>
>
OK.

Avati
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20120904/1645fa2b/attachment-0005.html>