[Gluster-devel] Query regards to heal xattr heal in dht

Thu Sep 15 11:51:07 UTC 2016

----- Original Message -----
> From: "Xavier Hernandez" <xhernandez at datalab.es>
> To: "Raghavendra G" <raghavendra at gluster.com>, "Nithya Balachandran" <nbalacha at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Mohit Agrawal" <moagrawa at redhat.com>
> Sent: Thursday, September 15, 2016 4:54:25 PM
> Subject: Re: [Gluster-devel] Query regards to heal xattr heal in dht
> 
> 
> 
> On 15/09/16 11:31, Raghavendra G wrote:
> >
> >
> > On Thu, Sep 15, 2016 at 12:02 PM, Nithya Balachandran
> > <nbalacha at redhat.com <mailto:nbalacha at redhat.com>> wrote:
> >
> >
> >
> >     On 8 September 2016 at 12:02, Mohit Agrawal <moagrawa at redhat.com
> >     <mailto:moagrawa at redhat.com>> wrote:
> >
> >         Hi All,
> >
> >            I have one another solution to heal user xattr but before
> >         implement it i would like to discuss with you.
> >
> >            Can i call function (dht_dir_xattr_heal internally it is
> >         calling syncop_setxattr) to heal xattr in dht_getxattr_cbk in last
> >            after make sure we have a valid xattr.
> >            In function(dht_dir_xattr_heal) it will copy blindly all user
> >         xattr on all subvolume or i can compare subvol xattr with valid
> >         xattr if there is any mismatch then i will call syncop_setxattr
> >         otherwise no need to call. syncop_setxattr.
> >
> >
> >
> >     This can be problematic if a particular xattr is being removed - it
> >     might still exist on some subvols. IIUC, the heal would go and reset
> >     it again?
> >
> >     One option is to use the hash subvol for the dir as the source - so
> >     perform xattr op on hashed subvol first and on the others only if it
> >     succeeds on the hashed. This does have the problem of being unable
> >     to set xattrs if the hashed subvol is unavailable. This might not be
> >     such a big deal in case of distributed replicate or distribute
> >     disperse volumes but will affect pure distribute. However, this way
> >     we can at least be reasonably certain of the correctness (leaving
> >     rebalance out of the picture).
> >
> >
> > * What is the behavior of getxattr when hashed subvol is down? Should we
> > succeed with values from non-hashed subvols or should we fail getxattr?
> > With hashed-subvol as source of truth, its difficult to determine
> > correctness of xattrs and their values when it is down.
> >
> > * setxattr is an inode operation (as opposed to entry operation). So, we
> > cannot calculate hashed-subvol as in (get)(set)xattr, parent layout and
> > "basename" is not available. This forces us to store hashed subvol in
> > inode-ctx. Now, when the hashed-subvol changes we need to update these
> > inode-ctxs too.
> >
> > What do you think about a Quorum based solution to this problem?
> >
> > 1. setxattr succeeds only if it is successful on at least (n/2 + 1)
> > number of subvols.
> > 2. getxattr succeeds only if it is successful and values match on at
> > least (n/2 + 1) number of subvols.
> >
> > The flip-side of this solution is we are increasing the probability of
> > failure of (get)(set)xattr operations as opposed to the hashed-subvol as
> > source of truth solution. Or are we - how do we compare probability of
> > hashed-subvol going down with probability of (n/2 + 1) nodes going down
> > simultaneously? Is it 1/n vs (1/n*1/n*... (n/2+1 times)?. Is 1/n correct
> > probability for _a specific subvol (hashed-subvol)_ going down (as
> > opposed to _any one subvol_ going down)?
> 
> If we suppose p to be the probability of failure of a subvolume in a
> period of time (a year for example), all subvolumes have the same
> probability, and we have N subvolumes, then:
> 
> Probability of failure of hashed-subvol: p
> Probability of failure of N/2 + 1 or more subvols: <attached as an image>

Thanks Xavi. That was quick :).

> 
> Note that this probability says how much probable is that N/2 + 1
> subvols or more fail in the specified period of time, but not
> necessarily simultaneously. If we suppose that subvolumes are recovered
> as fast as possible, the real probability of simultaneous failure will
> be much smaller.
> 
> In worst case (not recovering the failed subvolumes in the given period
> of time), if p < 0.5 or N = 2 (and p != 1), then it's always better to
> check N/2 + 1 subvolumes. Otherwise, it's better to check the hashed-subvol.
> 
> I think that p should always be much smaller than 0.5 for small periods
> of time where subvolume recovery could no be completed before other
> failures, so checking half plus one subvols should always be the best
> option in terms of probability. Performance can suffer though if some
> kind of synchronization is needed.

For this problem, no synchronization is needed. We need to wind the (get)(set)xattr call to all subvols though. What I didn't think through is rollback/rollforward during setxattr if the op fails on more than quorum subvols. One problem with rollback approach is that we may never get a chance to rollback at all and how do we handle racing setxattrs on the same key from different clients/apps (coupled with rollback etc). I need to think more about this.

> 
> Xavi
> >
> >
> >
> >
> >
> >
> >            Let me know if this approach is suitable.
> >
> >
> >
> >         Regards
> >         Mohit Agrawal
> >
> >         On Wed, Sep 7, 2016 at 10:27 PM, Pranith Kumar Karampuri
> >         <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
> >
> >
> >
> >             On Wed, Sep 7, 2016 at 9:46 PM, Mohit Agrawal
> >             <moagrawa at redhat.com <mailto:moagrawa at redhat.com>> wrote:
> >
> >                 Hi Pranith,
> >
> >
> >                 In current approach i am getting list of xattr from
> >                 first up volume and update the user attributes from that
> >                 xattr to
> >                 all other volumes.
> >
> >                 I have assumed first up subvol is source and rest of
> >                 them are sink as we are doing same in dht_dir_attr_heal.
> >
> >
> >             I think first up subvol is different for different mounts as
> >             per my understanding, I could be wrong.
> >
> >
> >
> >                 Regards
> >                 Mohit Agrawal
> >
> >                 On Wed, Sep 7, 2016 at 9:34 PM, Pranith Kumar Karampuri
> >                 <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
> >
> >                     hi Mohit,
> >                            How does dht find which subvolume has the
> >                     correct list of xattrs? i.e. how does it determine
> >                     which subvolume is source and which is sink?
> >
> >                     On Wed, Sep 7, 2016 at 2:35 PM, Mohit Agrawal
> >                     <moagrawa at redhat.com <mailto:moagrawa at redhat.com>>
> >                     wrote:
> >
> >                         Hi,
> >
> >                           I am trying to find out solution of one
> >                         problem in dht specific to user xattr healing.
> >                           I tried to correct it in a same way as we are
> >                         doing for healing dir attribute but i feel it is
> >                         not best solution.
> >
> >                           To find a right way to heal xattr i want to
> >                         discuss with you if anyone does have better
> >                         solution to correct it.
> >
> >                           Problem:
> >                            In a distributed volume environment custom
> >                         extended attribute value for a directory does
> >                         not display correct value after stop/start the
> >                         brick. If any extended attribute value is set
> >                         for a directory after stop the brick the
> >                         attribute value is not updated on brick after
> >                         start the brick.
> >
> >                           Current approach:
> >                             1) function set_user_xattr to store user
> >                         extended attribute in dictionary
> >                             2) function dht_dir_xattr_heal call
> >                         syncop_setxattr to update the attribute on all
> >                         volume
> >                             3) Call the function (dht_dir_xattr_heal)
> >                         for every directory lookup in
> >                         dht_lookup_revalidate_cbk
> >
> >                           Psuedocode for function dht_dir_xatt_heal is
> >                         like below
> >
> >                            1) First it will fetch atttributes from first
> >                         up volume and store into xattr.
> >                            2) Run loop on all subvolume and fetch
> >                         existing attributes from every volume
> >                            3) Replace user attributes from current
> >                         attributes with xattr user attributes
> >                            4) Set latest extended attributes(current +
> >                         old user attributes) inot subvol.
> >
> >
> >                            In this current approach problem is
> >
> >                            1) it will call heal
> >                         function(dht_dir_xattr_heal) for every directory
> >                         lookup without comparing xattr.
> >                             2) The function internally call syncop xattr
> >                         for every subvolume that would be a expensive
> >                         operation.
> >
> >                            I have one another way like below to correct
> >                         it but again in this one it does have dependency
> >                         on time (not sure time is synch on all bricks or
> >                         not)
> >
> >                            1) At the time of set extended
> >                         attribute(setxattr) change time in metadata at
> >                         server side
> >                            2) Compare change time before call healing
> >                         function in dht_revalidate_cbk
> >
> >                             Please share your input on this.
> >                             Appreciate your input.
> >
> >                         Regards
> >                         Mohit Agrawal
> >
> >                         _______________________________________________
> >                         Gluster-devel mailing list
> >                         Gluster-devel at gluster.org
> >                         <mailto:Gluster-devel at gluster.org>
> >                         http://www.gluster.org/mailman/listinfo/gluster-devel
> >                         <http://www.gluster.org/mailman/listinfo/gluster-devel>
> >
> >
> >
> >
> >                     --
> >                     Pranith
> >
> >
> >
> >
> >
> >             --
> >             Pranith
> >
> >
> >
> >         _______________________________________________
> >         Gluster-devel mailing list
> >         Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> >         http://www.gluster.org/mailman/listinfo/gluster-devel
> >         <http://www.gluster.org/mailman/listinfo/gluster-devel>
> >
> >
> >
> >     _______________________________________________
> >     Gluster-devel mailing list
> >     Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> >     http://www.gluster.org/mailman/listinfo/gluster-devel
> >     <http://www.gluster.org/mailman/listinfo/gluster-devel>
> >
> >
> >
> >
> > --
> > Raghavendra G
> >
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> >
>