[Gluster-devel] Inconsistent behavior due to lack of lookup on entry followed by readdirp

Krutika Dhananjay kdhananj at redhat.com
Thu Aug 13 11:56:42 UTC 2015


----- Original Message -----

> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> To: "Krutika Dhananjay" <kdhananj at redhat.com>
> Cc: "Mohammed Rafi K C" <rkavunga at redhat.com>, "Gluster Devel"
> <gluster-devel at gluster.org>, "Dan Lambright" <dlambrig at redhat.com>, "Nithya
> Balachandran" <nbalacha at redhat.com>, "Ben Turner" <bturner at redhat.com>, "Ben
> England" <bengland at redhat.com>, "Manoj Pillai" <mpillai at redhat.com>,
> "Pranith Kumar Karampuri" <pkarampu at redhat.com>, "Ravishankar
> Narayanankutty" <ranaraya at redhat.com>, xhernandez at datalab.es
> Sent: Thursday, August 13, 2015 3:51:43 PM
> Subject: Re: Inconsistent behavior due to lack of lookup on entry followed by
> readdirp

> ----- Original Message -----
> > From: "Krutika Dhananjay" <kdhananj at redhat.com>
> > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > Cc: "Mohammed Rafi K C" <rkavunga at redhat.com>, "Gluster Devel"
> > <gluster-devel at gluster.org>, "Dan Lambright"
> > <dlambrig at redhat.com>, "Nithya Balachandran" <nbalacha at redhat.com>, "Ben
> > Turner" <bturner at redhat.com>, "Ben England"
> > <bengland at redhat.com>, "Manoj Pillai" <mpillai at redhat.com>, "Pranith Kumar
> > Karampuri" <pkarampu at redhat.com>,
> > "Ravishankar Narayanankutty" <ranaraya at redhat.com>, xhernandez at datalab.es
> > Sent: Thursday, August 13, 2015 9:53:41 AM
> > Subject: Re: Inconsistent behavior due to lack of lookup on entry followed
> > by readdirp
> >
> > ----- Original Message -----
> >
> > > From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > To: "Krutika Dhananjay" <kdhananj at redhat.com>
> > > Cc: "Mohammed Rafi K C" <rkavunga at redhat.com>, "Gluster Devel"
> > > <gluster-devel at gluster.org>, "Dan Lambright" <dlambrig at redhat.com>,
> > > "Nithya
> > > Balachandran" <nbalacha at redhat.com>, "Ben Turner" <bturner at redhat.com>,
> > > "Ben
> > > England" <bengland at redhat.com>, "Manoj Pillai" <mpillai at redhat.com>,
> > > "Pranith Kumar Karampuri" <pkarampu at redhat.com>, "Ravishankar
> > > Narayanankutty" <ranaraya at redhat.com>, xhernandez at datalab.es
> > > Sent: Thursday, August 13, 2015 9:06:37 AM
> > > Subject: Re: Inconsistent behavior due to lack of lookup on entry
> > > followed
> > > by
> > > readdirp
> >
> > > ----- Original Message -----
> > > > From: "Krutika Dhananjay" <kdhananj at redhat.com>
> > > > To: "Mohammed Rafi K C" <rkavunga at redhat.com>
> > > > Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Dan Lambright"
> > > > <dlambrig at redhat.com>, "Nithya Balachandran"
> > > > <nbalacha at redhat.com>, "Raghavendra Gowdappa" <rgowdapp at redhat.com>,
> > > > "Ben
> > > > Turner" <bturner at redhat.com>, "Ben
> > > > England" <bengland at redhat.com>, "Manoj Pillai" <mpillai at redhat.com>,
> > > > "Pranith Kumar Karampuri"
> > > > <pkarampu at redhat.com>, "Ravishankar Narayanankutty"
> > > > <ranaraya at redhat.com>,
> > > > xhernandez at datalab.es
> > > > Sent: Wednesday, August 12, 2015 9:02:44 PM
> > > > Subject: Re: Inconsistent behavior due to lack of lookup on entry
> > > > followed
> > > > by readdirp
> > > >
> > > > I faced the same issue with the sharding translator. I fixed it by
> > > > making
> > > > its
> > > > readdirp callback initialize individual entries' inode ctx, some of
> > > > these
> > > > being xattr values, which are filled in entry->dict by the posix
> > > > translator.
> > > > Here is the patch that got merged recently:
> > > > http://review.gluster.org/11854
> > > > Would that be as easy to do in DHT as well?
> >
> > > The problem is not just filling out state in the inode. The bigger
> > > problem
> > > is
> > > healing, which is supposed to maintain a directory/file to be in state
> > > consistent with our design before a successful reply to lookup. The
> > > operations can involve creating directories on missing subvols, setting
> > > appropriate layout, etc. Effectively for readdirp to replace lookup, it
> > > should be calling dht_lookup on each of the dentry it is passing back to
> > > application.
> >
> > OK.
> >
> > > >
> > > > As far as AFR is concerned, it indirectly forces LOOKUP on entries
> > > > which
> > > > are
> > > > being retrieved for the first time through a READDIRP (and as a result
> > > > do
> > > > not have their inode ctx etc initialised yet) by setting entry->inode
> > > > to
> > > > NULL. See afr_readdir_transform_entries().
> >
> > > Hmm. Then we already have "disabled" readdirp through code :). Without an
> > > inode corresponding to entry, readdirp will be effectively readdir
> > > stripping
> > > any performance benefits by having readdirp as a "batched" lookup (of all
> > > the dentries).
> > No. Not every single READDIRP will be transformed into a READDIR by AFR.
> > AFR
> > resets the inode corresponding to an entry, before responding to its
> > parent,
> > _only_ under the following two conditions:
> > 1) if this entry in question is being retrieved by this client for the
> > first
> > time through a READDIRP. In other words, this client has not _yet_
> > performed
> > a LOOKUP on it.
> > 2) if that sub-volume of AFR on which the parent directory is being
> > READDIRP'd (remember AFR would only need to serve inode and directory reads
> > from one of the replicas) does _not_ contain a good copy of the entry.
> > In other words this entry needs to be healed on parent's read child. This
> > is
> > because we do not want the caching translators or the application itself to
> > get incorrect entry attributes.

> Thanks Krutika. We'll be borrowing the idea of setting entry->inode to NULL,
> when dht determines that inode needs to be healed. Since afr is already
> doing that for all dentries during first READDIRP (barring any lookups on
> that inode before), I don't think doing this will have any further
> performance degradation (As most of the setups will be
> distributed-replicated).
Yep. As long as we don't have http://review.gluster.org/#/c/11846/ , AFR will mask the behavior that you would be introducing in DHT. 
However, once Pranith is back, he should be able to give us good reasons for merging this patch. 

-Krutika 

> >
> > This means that more often than not, AFR _would_ be leaving the inode
> > corresponding to the entry as it is, and not setting it to NULL.
> >
> > > > This is the default behavior which is being made optional as part of
> > > > http://review.gluster.org/#/c/11846/ which is still under review (see
> > > > BZ
> > > > 1250803, a performance bug :) ).
> >
> > > If it is made optional, when we enable setting entry->inode we still see
> > > consistency issues. Also, it seems to me that there is no point in having
> > > each individual xlator option controlling this behaviour. Instead we can
> > > make each xlator behave in compliance to global mount option
> > > "--use-readdirp=yes/no". Is there any specific reason to have an option
> > > to
> > > control this behaviour in afr?
> > Agreed that there will be consistency issues.
> > The reason to move away from letting 1) and 2) above be the default
> > behavior
> > is performance. :) And I guess it is also partly because AFR-v1 does not
> > have these restrictions in READDIRP. But the patch is still under review.
> >
> > -Krutika
> >
> > > >
> > > > -Krutika
> > > >
> > > > ----- Original Message -----
> > > >
> > > > > From: "Mohammed Rafi K C" <rkavunga at redhat.com>
> > > > > To: "Gluster Devel" <gluster-devel at gluster.org>
> > > > > Cc: "Dan Lambright" <dlambrig at redhat.com>, "Nithya Balachandran"
> > > > > <nbalacha at redhat.com>, "Raghavendra Gowdappa" <rgowdapp at redhat.com>,
> > > > > "Ben
> > > > > Turner" <bturner at redhat.com>, "Ben England" <bengland at redhat.com>,
> > > > > "Manoj
> > > > > Pillai" <mpillai at redhat.com>, "Pranith Kumar Karampuri"
> > > > > <pkarampu at redhat.com>, "Ravishankar Narayanankutty"
> > > > > <ranaraya at redhat.com>,
> > > > > kdhananj at redhat.com, xhernandez at datalab.es
> > > > > Sent: Wednesday, August 12, 2015 7:29:48 PM
> > > > > Subject: Inconsistent behavior due to lack of lookup on entry
> > > > > followed
> > > > > by
> > > > > readdirp
> > > >
> > > > > Hi All,
> > > >
> > > > > We are facing some inconsistent behavior for fops like rename, unlink
> > > > > etc due to lack of lookup followed by a readdirp, more specifically
> > > > > if
> > > > > inodes/gfid are populated via readdirp call and this nodeid is shared
> > > > > with kernal, md-cache will cache this based on base-name. Then
> > > > > subsequent named lookup will be served from md-cache and it
> > > > > winds-back
> > > > > immediately. So there is a chance to have an FOP triggered with out
> > > > > having a lookup on an entry. DHT does lot of things like creating
> > > > > link
> > > > > files and populate inode_ctx etc, during lookup. In such scenario it
> > > > > is
> > > > > must to have at least one lookup to be happened on an entry. Since
> > > > > readdirp preventing the lookup, it has been very hard for fops to
> > > > > proceed without a first lookup on the entry. We are also suspecting
> > > > > some
> > > > > problems due to same with afr/ec self healing also. So If we remove
> > > > > readdirp from md-cache ([1], [2]) it causes, an additional hop for
> > > > > first
> > > > > lookup for every entry. I'm mostly concerned with this one extra
> > > > > network
> > > > > call, and the performance degradation caused by the same.
> > > >
> > > > > Now with this, the only advantage with readdirp is, it removes one
> > > > > context switch between kernal and userspace. Is it really worth to
> > > > > sacrifice this for consistency ?
> > > >
> > > > > What do you think about removing readdirp functionality?
> > > >
> > > > > Please provide your input/suggestion/ideas.
> > > >
> > > > > [1] : http://review.gluster.org/#/c/11892/
> > > >
> > > > > [2] : http://review.gluster.org/#/c/11894/
> > > >
> > > > > Thanks in Advance
> > > > > Rafi KC
> > > >
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150813/44f00f31/attachment-0001.html>


More information about the Gluster-devel mailing list