[Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

Mon Jun 6 09:16:46 UTC 2016

Hi Raghavendra,

On 06/06/16 10:54, Raghavendra G wrote:
>
>
> On Wed, Jun 1, 2016 at 12:50 PM, Xavier Hernandez <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>> wrote:
>
>     Hi,
>
>     On 01/06/16 08:53, Raghavendra Gowdappa wrote:
>
>
>
>         ----- Original Message -----
>
>             From: "Xavier Hernandez" <xhernandez at datalab.es
>             <mailto:xhernandez at datalab.es>>
>             To: "Pranith Kumar Karampuri" <pkarampu at redhat.com
>             <mailto:pkarampu at redhat.com>>, "Raghavendra G"
>             <raghavendra at gluster.com <mailto:raghavendra at gluster.com>>
>             Cc: "Gluster Devel" <gluster-devel at gluster.org
>             <mailto:gluster-devel at gluster.org>>
>             Sent: Wednesday, June 1, 2016 11:57:12 AM
>             Subject: Re: [Gluster-devel] dht mkdir preop check, afr and
>             (non-)readable afr subvols
>
>             Oops, you are right. For entry operations the current
>             version of the
>             parent directory is not checked, just to avoid this problem.
>
>             This means that mkdir will be sent to all alive subvolumes.
>             However it
>             still selects the group of answers that have a minimum
>             quorum equal or
>             greater than #bricks - redundancy. So it should be still valid.
>
>
>         What if the quorum is met on "bad" subvolumes? and mkdir was
>         successful on bad subvolumes? Do we consider mkdir as
>         successful? If yes, even EC suffers from the problem described
>         in bz https://bugzilla.redhat.com/show_bug.cgi?id=1341429.
>
>
>     I don't understand the real problem. How a subvolume of EC could be
>     in bad state from the point of view of DHT ?
>
>     If you use xattrs to configure something in the parent directories,
>     you should have needed to use setxattr or xattrop to do that. These
>     operations do consider good/bad bricks because they touch inode
>     metadata. This will only succeed if enough (quorum) bricks have
>     successfully processed it. If quorum is met but for an error answer,
>     an error will be reported to DHT and the majority of bricks will be
>     left in the old state (these should be considered the good
>     subvolumes). If some brick has succeeded, it will be considered bad
>     and will be healed. If no quorum is met (even for an error answer),
>     EIO will be returned and the state of the directory should be
>     considered unknown/damaged.
>
>
> Yes. Ideally, dht should use a getxattr for the layout xattr. But, for
> performance reasons we thought of overloading mkdir by introducing
> pre-operations (done by bricks). With plain dht it is a simple
> comparison of xattrs passed as argument and xattrs stored on disk. But,
> I failed to include afr and EC in the picture.

I still miss something. Looking at the patch that implements this 
(http://review.gluster.org/13885), it seems that mkdir fails if the 
parent xattr is no correctly set, so it's not possible to create a 
directory on a "bad" brick.

If the majority of the subvolumes of ec fail, the whole request will 
fail and this failure will be reported to DHT. If the majority succeed, 
it will be reported to DHT, even is some of the subvolumes have failed.

Maybe if you give me a specific example I may see the real problem.

Xavi

> Hence this issue. How
> difficult for EC and AFR to bring this kind of check? Is it even
> possible for afr and EC to implement this kind of pre-op checks with
> reasonable complexity?
>
>
>     If a later mkdir checks this value in storage/posix and succeeds in
>     enough bricks, it necessarily means that is has succeeded in good
>     bricks, because there cannot be enough bricks with the bad xattr value.
>
>     Note that quorum is always > #bricks/2 so we cannot have a quorum
>     with good and bad bricks at the same time.
>
>     Xavi
>
>
>
>
>             Xavi
>
>             On 01/06/16 06:51, Pranith Kumar Karampuri wrote:
>
>                 Xavi,
>                         But if we keep winding only to good subvolumes,
>                 there is a case
>                 where bad subvolumes will never catch up right? i.e. if
>                 we keep creating
>                 files in same directory and everytime self-heal
>                 completes there are more
>                 entries mounts would have created on the good subvolumes
>                 alone. I think
>                 I must have missed this in the reviews if this is the
>                 current behavior.
>                 It was not in the earlier releases. Right?
>
>                 Pranith
>
>                 On Tue, May 31, 2016 at 2:17 PM, Raghavendra G
>                 <raghavendra at gluster.com <mailto:raghavendra at gluster.com>
>                 <mailto:raghavendra at gluster.com
>                 <mailto:raghavendra at gluster.com>>> wrote:
>
>
>
>                     On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez
>                     <xhernandez at datalab.es
>                 <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>                 <mailto:xhernandez at datalab.es>>> wrote:
>
>                         Hi,
>
>                         On 31/05/16 07:05, Raghavendra Gowdappa wrote:
>
>                             +gluster-devel, +Xavi
>
>                             Hi all,
>
>                             The context is [1], where bricks do
>                 pre-operation checks
>                             before doing a fop and proceed with fop only
>                 if pre-op check
>                             is successful.
>
>                             @Xavi,
>
>                             We need your inputs on behavior of EC
>                 subvolumes as well.
>
>
>                         If I understand correctly, EC shouldn't have any
>                 problems here.
>
>                         EC sends the mkdir request to all subvolumes
>                 that are currently
>                         considered "good" and tries to combine the
>                 answers. Answers that
>                         match in return code, errno (if necessary) and
>                 xdata contents
>                         (except for some special xattrs that are ignored
>                 for combination
>                         purposes), are grouped.
>
>                         Then it takes the group with more
>                 members/answers. If that group
>                         has a minimum size of #bricks - redundancy, it
>                 is considered the
>                         good answer. Otherwise EIO is returned because
>                 bricks are in an
>                         inconsistent state.
>
>                         If there's any answer in another group, it's
>                 considered bad and
>                         gets marked so that self-heal will repair it
>                 using the good
>                         information from the majority of bricks.
>
>                         xdata is combined and returned even if return
>                 code is -1.
>
>                         Is that enough to cover the needed behavior ?
>
>
>                     Thanks Xavi. That's sufficient for the feature in
>                 question. One of
>                     the main cases I was interested in was what would be
>                 the behaviour
>                     if mkdir succeeds on "bad" subvolume and fails on
>                 "good" subvolume.
>                     Since you never wind mkdir to "bad" subvolume(s),
>                 this situation
>                     never arises.
>
>
>
>
>                         Xavi
>
>
>
>                             [1] http://review.gluster.org/13885
>
>                             regards,
>                             Raghavendra
>
>                             ----- Original Message -----
>
>                                 From: "Pranith Kumar Karampuri"
>                 <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>                                 <mailto:pkarampu at redhat.com
>                 <mailto:pkarampu at redhat.com>>>
>                                 To: "Raghavendra Gowdappa"
>                 <rgowdapp at redhat.com <mailto:rgowdapp at redhat.com>
>                                 <mailto:rgowdapp at redhat.com
>                 <mailto:rgowdapp at redhat.com>>>
>                                 Cc: "team-quine-afr"
>                 <team-quine-afr at redhat.com
>                 <mailto:team-quine-afr at redhat.com>
>                                 <mailto:team-quine-afr at redhat.com
>                 <mailto:team-quine-afr at redhat.com>>>, "rhs-zteam"
>                                 <rhs-zteam at redhat.com
>                 <mailto:rhs-zteam at redhat.com>
>                 <mailto:rhs-zteam at redhat.com <mailto:rhs-zteam at redhat.com>>>
>                                 Sent: Tuesday, May 31, 2016 10:22:49 AM
>                                 Subject: Re: dht mkdir preop check, afr and
>                                 (non-)readable afr subvols
>
>                                 I think you should start a discussion on
>                 gluster-devel
>                                 so that Xavi gets a
>                                 chance to respond on the mails as well.
>
>                                 On Tue, May 31, 2016 at 10:21 AM,
>                 Raghavendra Gowdappa
>                                 <rgowdapp at redhat.com
>                 <mailto:rgowdapp at redhat.com> <mailto:rgowdapp at redhat.com
>                 <mailto:rgowdapp at redhat.com>>>
>                                 wrote:
>
>                                     Also note that we've plans to extend
>                 this pre-op
>                                     check to all dentry
>                                     operations which also depend parent
>                 layout. So, the
>                                     discussion need to
>                                     cover all dentry operations like:
>
>                                     1. create
>                                     2. mkdir
>                                     3. rmdir
>                                     4. mknod
>                                     5. symlink
>                                     6. unlink
>                                     7. rename
>
>                                     We also plan to have similar checks
>                 in lock codepath
>                                     for directories too
>                                     (planning to use hashed-subvolume as
>                 lock-subvolume
>                                     for directories). So,
>                                     more fops :)
>                                     8. lk (posix locks)
>                                     9. inodelk
>                                     10. entrylk
>
>                                     regards,
>                                     Raghavendra
>
>                                     ----- Original Message -----
>
>                                         From: "Raghavendra Gowdappa"
>                                         <rgowdapp at redhat.com
>                 <mailto:rgowdapp at redhat.com> <mailto:rgowdapp at redhat.com
>                 <mailto:rgowdapp at redhat.com>>>
>                                         To: "team-quine-afr"
>                 <team-quine-afr at redhat.com
>                 <mailto:team-quine-afr at redhat.com>
>
>                 <mailto:team-quine-afr at redhat.com
>                 <mailto:team-quine-afr at redhat.com>>>
>                                         Cc: "rhs-zteam"
>                 <rhs-zteam at redhat.com <mailto:rhs-zteam at redhat.com>
>                                         <mailto:rhs-zteam at redhat.com
>                 <mailto:rhs-zteam at redhat.com>>>
>                                         Sent: Tuesday, May 31, 2016
>                 10:15:04 AM
>                                         Subject: dht mkdir preop check,
>                 afr and
>                                         (non-)readable afr subvols
>
>                                         Hi all,
>
>                                         I have some queries related to
>                 the behavior of
>                                         afr_mkdir with respect to
>                                         readable subvols.
>
>                                         1. While winding mkdir to
>                 subvols does afr check
>                                         whether the subvolume is
>                                         good/readable? Or does it wind
>                 to all subvols
>                                         irrespective of whether a
>                                         subvol is good/bad? In the
>                 latter case, what if
>                                            a. mkdir succeeds on
>                 non-readable subvolume
>                                            b. fails on readable subvolume
>
>                                           What is the result reported to
>                 higher layers
>                                         in the above scenario? If
>                                           mkdir is failed, is it cleaned
>                 up on
>                                         non-readable subvolume where it
>                                           failed?
>
>                                         I am interested in this case as
>                 dht-preop check
>                                         relies on layout xattrs
>
>                                     and I
>
>                                         assume layout xattrs in
>                 particular (and all
>                                         xattrs in general) are
>                                         guaranteed to be correct only on
>                 a readable
>                                         subvolume of afr. So, in
>
>                                     essence
>
>                                         we shouldn't be winding down
>                 mkdir on
>                                         non-readable subvols as whatever
>
>                                     the
>
>                                         decision brick makes as part of
>                 pre-op check is
>                                         inherently flawed.
>
>                                         regards,
>                                         Raghavendra
>
>                                 --
>                                 Pranith
>
>                         _______________________________________________
>                         Gluster-devel mailing list
>                         Gluster-devel at gluster.org
>                 <mailto:Gluster-devel at gluster.org>
>                 <mailto:Gluster-devel at gluster.org
>                 <mailto:Gluster-devel at gluster.org>>
>
>                 http://www.gluster.org/mailman/listinfo/gluster-devel
>
>
>
>
>                     --
>                     Raghavendra G
>
>                     _______________________________________________
>                     Gluster-devel mailing list
>                     Gluster-devel at gluster.org
>                 <mailto:Gluster-devel at gluster.org>
>                 <mailto:Gluster-devel at gluster.org
>                 <mailto:Gluster-devel at gluster.org>>
>                     http://www.gluster.org/mailman/listinfo/gluster-devel
>
>
>
>
>                 --
>                 Pranith
>
>
>     _______________________________________________
>     Gluster-devel mailing list
>     Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>     http://www.gluster.org/mailman/listinfo/gluster-devel
>
>
>
>
> --
> Raghavendra G