[Gluster-devel] Query regards to heal xattr heal in dht

Fri Sep 16 12:54:47 UTC 2016

Hi,

I think we should divide the problem into two parts.

  1) User Extended attribute is not correctly showing by getxattr on mount
point.
  2) healing user xattr on brick those were down at the time of run
setxattr.

To print the correct extended attribute on mount point i think quorum
approach is good.I think it is sufficient to consider as a source nodes if
more than half of nodes have same user xattr value.

How we can find correct value?

1) For every volume specific to user xattr key/value pair we can calculate
hash value and store hash value in dict for key as volume-instance.
2) Find out the volumes from dict those have same hash value and return
xattr to the application.

The volumes those have same hash value we can consider as a source and
others are sink.From latest patch(http://review.gluster.org/#/c/15468/) i
am not deleting any xattr, it will replace the existing user xattr on
volume in case if the same does exist otherwise it will create new xattr.

For specific to ALC/SeLinux because i am updating only user xattr so it
will be remain same after done heal function.

Regards
Mohit Agrawal

On Fri, Sep 16, 2016 at 9:42 AM, Nithya Balachandran <nbalacha at redhat.com>
wrote:

>
>
> On 15 September 2016 at 17:21, Raghavendra Gowdappa <rgowdapp at redhat.com>
> wrote:
>
>>
>>
>> ----- Original Message -----
>> > From: "Xavier Hernandez" <xhernandez at datalab.es>
>> > To: "Raghavendra G" <raghavendra at gluster.com>, "Nithya Balachandran" <
>> nbalacha at redhat.com>
>> > Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Mohit Agrawal" <
>> moagrawa at redhat.com>
>> > Sent: Thursday, September 15, 2016 4:54:25 PM
>> > Subject: Re: [Gluster-devel] Query regards to heal xattr heal in dht
>> >
>> >
>> >
>> > On 15/09/16 11:31, Raghavendra G wrote:
>> > >
>> > >
>> > > On Thu, Sep 15, 2016 at 12:02 PM, Nithya Balachandran
>> > > <nbalacha at redhat.com <mailto:nbalacha at redhat.com>> wrote:
>> > >
>> > >
>> > >
>> > >     On 8 September 2016 at 12:02, Mohit Agrawal <moagrawa at redhat.com
>> > >     <mailto:moagrawa at redhat.com>> wrote:
>> > >
>> > >         Hi All,
>> > >
>> > >            I have one another solution to heal user xattr but before
>> > >         implement it i would like to discuss with you.
>> > >
>> > >            Can i call function (dht_dir_xattr_heal internally it is
>> > >         calling syncop_setxattr) to heal xattr in dht_getxattr_cbk in
>> last
>> > >            after make sure we have a valid xattr.
>> > >            In function(dht_dir_xattr_heal) it will copy blindly all
>> user
>> > >         xattr on all subvolume or i can compare subvol xattr with
>> valid
>> > >         xattr if there is any mismatch then i will call
>> syncop_setxattr
>> > >         otherwise no need to call. syncop_setxattr.
>> > >
>> > >
>> > >
>> > >     This can be problematic if a particular xattr is being removed -
>> it
>> > >     might still exist on some subvols. IIUC, the heal would go and
>> reset
>> > >     it again?
>> > >
>> > >     One option is to use the hash subvol for the dir as the source -
>> so
>> > >     perform xattr op on hashed subvol first and on the others only if
>> it
>> > >     succeeds on the hashed. This does have the problem of being unable
>> > >     to set xattrs if the hashed subvol is unavailable. This might not
>> be
>> > >     such a big deal in case of distributed replicate or distribute
>> > >     disperse volumes but will affect pure distribute. However, this
>> way
>> > >     we can at least be reasonably certain of the correctness (leaving
>> > >     rebalance out of the picture).
>> > >
>> > >
>> > > * What is the behavior of getxattr when hashed subvol is down? Should
>> we
>> > > succeed with values from non-hashed subvols or should we fail
>> getxattr?
>> > > With hashed-subvol as source of truth, its difficult to determine
>> > > correctness of xattrs and their values when it is down.
>> > >
>> > > * setxattr is an inode operation (as opposed to entry operation). So,
>> we
>> > > cannot calculate hashed-subvol as in (get)(set)xattr, parent layout
>> and
>> > > "basename" is not available. This forces us to store hashed subvol in
>> > > inode-ctx. Now, when the hashed-subvol changes we need to update these
>> > > inode-ctxs too.
>> > >
>> > > What do you think about a Quorum based solution to this problem?
>> > >
>> > > 1. setxattr succeeds only if it is successful on at least (n/2 + 1)
>> > > number of subvols.
>> > > 2. getxattr succeeds only if it is successful and values match on at
>> > > least (n/2 + 1) number of subvols.
>> > >
>> > > The flip-side of this solution is we are increasing the probability of
>> > > failure of (get)(set)xattr operations as opposed to the hashed-subvol
>> as
>> > > source of truth solution. Or are we - how do we compare probability of
>> > > hashed-subvol going down with probability of (n/2 + 1) nodes going
>> down
>> > > simultaneously? Is it 1/n vs (1/n*1/n*... (n/2+1 times)?. Is 1/n
>> correct
>> > > probability for _a specific subvol (hashed-subvol)_ going down (as
>> > > opposed to _any one subvol_ going down)?
>> >
>> > If we suppose p to be the probability of failure of a subvolume in a
>> > period of time (a year for example), all subvolumes have the same
>> > probability, and we have N subvolumes, then:
>> >
>> > Probability of failure of hashed-subvol: p
>> > Probability of failure of N/2 + 1 or more subvols: <attached as an
>> image>
>>
>> Thanks Xavi. That was quick :).
>>
>> >
>> > Note that this probability says how much probable is that N/2 + 1
>> > subvols or more fail in the specified period of time, but not
>> > necessarily simultaneously. If we suppose that subvolumes are recovered
>> > as fast as possible, the real probability of simultaneous failure will
>> > be much smaller.
>> >
>> > In worst case (not recovering the failed subvolumes in the given period
>> > of time), if p < 0.5 or N = 2 (and p != 1), then it's always better to
>> > check N/2 + 1 subvolumes. Otherwise, it's better to check the
>> hashed-subvol.
>> >
>> > I think that p should always be much smaller than 0.5 for small periods
>> > of time where subvolume recovery could no be completed before other
>> > failures, so checking half plus one subvols should always be the best
>> > option in terms of probability. Performance can suffer though if some
>> > kind of synchronization is needed.
>>
>> For this problem, no synchronization is needed. We need to wind the
>> (get)(set)xattr call to all subvols though. What I didn't think through is
>> rollback/rollforward during setxattr if the op fails on more than quorum
>> subvols. One problem with rollback approach is that we may never get a
>> chance to rollback at all and how do we handle racing setxattrs on the same
>> key from different clients/apps (coupled with rollback etc). I need to
>> think more about this.
>>
>>
> A quorum will make it difficult to figure out the correct value in case of
> in flight modifications/deletions How do you decide the correct value ?
> This can potentially cause in progress modifications /deletions to be
> overwritten and to fail silently. The single point of truth (hashed subvol)
> helps here. Else we will need to bring in some synchronization here.
>
> I have not looked at the patch yet but how does it currently handle xattr
> deletes that failed on a subvol?
>
> A hashed subvol being unavailable should hopefully not be very common esp
> if there is AFR or EC loaded. A pure distribute volume is expected to have
> some data unavailability if bricks are unavailable.
>
> The one case where this could cause major issues as Shyam pointed out is
> in the case of ACLs /SE linux where the information is stored in xattrs.
> Something to think about.
>
>
>> >
>> > Xavi
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >            Let me know if this approach is suitable.
>> > >
>> > >
>> > >
>> > >         Regards
>> > >         Mohit Agrawal
>> > >
>> > >         On Wed, Sep 7, 2016 at 10:27 PM, Pranith Kumar Karampuri
>> > >         <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>> > >
>> > >
>> > >
>> > >             On Wed, Sep 7, 2016 at 9:46 PM, Mohit Agrawal
>> > >             <moagrawa at redhat.com <mailto:moagrawa at redhat.com>> wrote:
>> > >
>> > >                 Hi Pranith,
>> > >
>> > >
>> > >                 In current approach i am getting list of xattr from
>> > >                 first up volume and update the user attributes from
>> that
>> > >                 xattr to
>> > >                 all other volumes.
>> > >
>> > >                 I have assumed first up subvol is source and rest of
>> > >                 them are sink as we are doing same in
>> dht_dir_attr_heal.
>> > >
>> > >
>> > >             I think first up subvol is different for different mounts
>> as
>> > >             per my understanding, I could be wrong.
>> > >
>> > >
>> > >
>> > >                 Regards
>> > >                 Mohit Agrawal
>> > >
>> > >                 On Wed, Sep 7, 2016 at 9:34 PM, Pranith Kumar
>> Karampuri
>> > >                 <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>
>> wrote:
>> > >
>> > >                     hi Mohit,
>> > >                            How does dht find which subvolume has the
>> > >                     correct list of xattrs? i.e. how does it determine
>> > >                     which subvolume is source and which is sink?
>> > >
>> > >                     On Wed, Sep 7, 2016 at 2:35 PM, Mohit Agrawal
>> > >                     <moagrawa at redhat.com <mailto:moagrawa at redhat.com
>> >>
>> > >                     wrote:
>> > >
>> > >                         Hi,
>> > >
>> > >                           I am trying to find out solution of one
>> > >                         problem in dht specific to user xattr healing.
>> > >                           I tried to correct it in a same way as we
>> are
>> > >                         doing for healing dir attribute but i feel it
>> is
>> > >                         not best solution.
>> > >
>> > >                           To find a right way to heal xattr i want to
>> > >                         discuss with you if anyone does have better
>> > >                         solution to correct it.
>> > >
>> > >                           Problem:
>> > >                            In a distributed volume environment custom
>> > >                         extended attribute value for a directory does
>> > >                         not display correct value after stop/start the
>> > >                         brick. If any extended attribute value is set
>> > >                         for a directory after stop the brick the
>> > >                         attribute value is not updated on brick after
>> > >                         start the brick.
>> > >
>> > >                           Current approach:
>> > >                             1) function set_user_xattr to store user
>> > >                         extended attribute in dictionary
>> > >                             2) function dht_dir_xattr_heal call
>> > >                         syncop_setxattr to update the attribute on all
>> > >                         volume
>> > >                             3) Call the function (dht_dir_xattr_heal)
>> > >                         for every directory lookup in
>> > >                         dht_lookup_revalidate_cbk
>> > >
>> > >                           Psuedocode for function dht_dir_xatt_heal is
>> > >                         like below
>> > >
>> > >                            1) First it will fetch atttributes from
>> first
>> > >                         up volume and store into xattr.
>> > >                            2) Run loop on all subvolume and fetch
>> > >                         existing attributes from every volume
>> > >                            3) Replace user attributes from current
>> > >                         attributes with xattr user attributes
>> > >                            4) Set latest extended attributes(current +
>> > >                         old user attributes) inot subvol.
>> > >
>> > >
>> > >                            In this current approach problem is
>> > >
>> > >                            1) it will call heal
>> > >                         function(dht_dir_xattr_heal) for every
>> directory
>> > >                         lookup without comparing xattr.
>> > >                             2) The function internally call syncop
>> xattr
>> > >                         for every subvolume that would be a expensive
>> > >                         operation.
>> > >
>> > >                            I have one another way like below to
>> correct
>> > >                         it but again in this one it does have
>> dependency
>> > >                         on time (not sure time is synch on all bricks
>> or
>> > >                         not)
>> > >
>> > >                            1) At the time of set extended
>> > >                         attribute(setxattr) change time in metadata at
>> > >                         server side
>> > >                            2) Compare change time before call healing
>> > >                         function in dht_revalidate_cbk
>> > >
>> > >                             Please share your input on this.
>> > >                             Appreciate your input.
>> > >
>> > >                         Regards
>> > >                         Mohit Agrawal
>> > >
>> > >                         _____________________________
>> __________________
>> > >                         Gluster-devel mailing list
>> > >                         Gluster-devel at gluster.org
>> > >                         <mailto:Gluster-devel at gluster.org>
>> > >                         http://www.gluster.org/mailma
>> n/listinfo/gluster-devel
>> > >                         <http://www.gluster.org/mailm
>> an/listinfo/gluster-devel>
>> > >
>> > >
>> > >
>> > >
>> > >                     --
>> > >                     Pranith
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >             --
>> > >             Pranith
>> > >
>> > >
>> > >
>> > >         _______________________________________________
>> > >         Gluster-devel mailing list
>> > >         Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>> > >         http://www.gluster.org/mailman/listinfo/gluster-devel
>> > >         <http://www.gluster.org/mailman/listinfo/gluster-devel>
>> > >
>> > >
>> > >
>> > >     _______________________________________________
>> > >     Gluster-devel mailing list
>> > >     Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>> > >     http://www.gluster.org/mailman/listinfo/gluster-devel
>> > >     <http://www.gluster.org/mailman/listinfo/gluster-devel>
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Raghavendra G
>> > >
>> > >
>> > > _______________________________________________
>> > > Gluster-devel mailing list
>> > > Gluster-devel at gluster.org
>> > > http://www.gluster.org/mailman/listinfo/gluster-devel
>> > >
>> >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160916/3ce9b2c8/attachment-0001.html>