[Gluster-devel] Query regards to heal xattr heal in dht
Raghavendra Gowdappa
rgowdapp at redhat.com
Thu Sep 15 11:51:07 UTC 2016
----- Original Message -----
> From: "Xavier Hernandez" <xhernandez at datalab.es>
> To: "Raghavendra G" <raghavendra at gluster.com>, "Nithya Balachandran" <nbalacha at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Mohit Agrawal" <moagrawa at redhat.com>
> Sent: Thursday, September 15, 2016 4:54:25 PM
> Subject: Re: [Gluster-devel] Query regards to heal xattr heal in dht
>
>
>
> On 15/09/16 11:31, Raghavendra G wrote:
> >
> >
> > On Thu, Sep 15, 2016 at 12:02 PM, Nithya Balachandran
> > <nbalacha at redhat.com <mailto:nbalacha at redhat.com>> wrote:
> >
> >
> >
> > On 8 September 2016 at 12:02, Mohit Agrawal <moagrawa at redhat.com
> > <mailto:moagrawa at redhat.com>> wrote:
> >
> > Hi All,
> >
> > I have one another solution to heal user xattr but before
> > implement it i would like to discuss with you.
> >
> > Can i call function (dht_dir_xattr_heal internally it is
> > calling syncop_setxattr) to heal xattr in dht_getxattr_cbk in last
> > after make sure we have a valid xattr.
> > In function(dht_dir_xattr_heal) it will copy blindly all user
> > xattr on all subvolume or i can compare subvol xattr with valid
> > xattr if there is any mismatch then i will call syncop_setxattr
> > otherwise no need to call. syncop_setxattr.
> >
> >
> >
> > This can be problematic if a particular xattr is being removed - it
> > might still exist on some subvols. IIUC, the heal would go and reset
> > it again?
> >
> > One option is to use the hash subvol for the dir as the source - so
> > perform xattr op on hashed subvol first and on the others only if it
> > succeeds on the hashed. This does have the problem of being unable
> > to set xattrs if the hashed subvol is unavailable. This might not be
> > such a big deal in case of distributed replicate or distribute
> > disperse volumes but will affect pure distribute. However, this way
> > we can at least be reasonably certain of the correctness (leaving
> > rebalance out of the picture).
> >
> >
> > * What is the behavior of getxattr when hashed subvol is down? Should we
> > succeed with values from non-hashed subvols or should we fail getxattr?
> > With hashed-subvol as source of truth, its difficult to determine
> > correctness of xattrs and their values when it is down.
> >
> > * setxattr is an inode operation (as opposed to entry operation). So, we
> > cannot calculate hashed-subvol as in (get)(set)xattr, parent layout and
> > "basename" is not available. This forces us to store hashed subvol in
> > inode-ctx. Now, when the hashed-subvol changes we need to update these
> > inode-ctxs too.
> >
> > What do you think about a Quorum based solution to this problem?
> >
> > 1. setxattr succeeds only if it is successful on at least (n/2 + 1)
> > number of subvols.
> > 2. getxattr succeeds only if it is successful and values match on at
> > least (n/2 + 1) number of subvols.
> >
> > The flip-side of this solution is we are increasing the probability of
> > failure of (get)(set)xattr operations as opposed to the hashed-subvol as
> > source of truth solution. Or are we - how do we compare probability of
> > hashed-subvol going down with probability of (n/2 + 1) nodes going down
> > simultaneously? Is it 1/n vs (1/n*1/n*... (n/2+1 times)?. Is 1/n correct
> > probability for _a specific subvol (hashed-subvol)_ going down (as
> > opposed to _any one subvol_ going down)?
>
> If we suppose p to be the probability of failure of a subvolume in a
> period of time (a year for example), all subvolumes have the same
> probability, and we have N subvolumes, then:
>
> Probability of failure of hashed-subvol: p
> Probability of failure of N/2 + 1 or more subvols: <attached as an image>
Thanks Xavi. That was quick :).
>
> Note that this probability says how much probable is that N/2 + 1
> subvols or more fail in the specified period of time, but not
> necessarily simultaneously. If we suppose that subvolumes are recovered
> as fast as possible, the real probability of simultaneous failure will
> be much smaller.
>
> In worst case (not recovering the failed subvolumes in the given period
> of time), if p < 0.5 or N = 2 (and p != 1), then it's always better to
> check N/2 + 1 subvolumes. Otherwise, it's better to check the hashed-subvol.
>
> I think that p should always be much smaller than 0.5 for small periods
> of time where subvolume recovery could no be completed before other
> failures, so checking half plus one subvols should always be the best
> option in terms of probability. Performance can suffer though if some
> kind of synchronization is needed.
For this problem, no synchronization is needed. We need to wind the (get)(set)xattr call to all subvols though. What I didn't think through is rollback/rollforward during setxattr if the op fails on more than quorum subvols. One problem with rollback approach is that we may never get a chance to rollback at all and how do we handle racing setxattrs on the same key from different clients/apps (coupled with rollback etc). I need to think more about this.
>
> Xavi
> >
> >
> >
> >
> >
> >
> > Let me know if this approach is suitable.
> >
> >
> >
> > Regards
> > Mohit Agrawal
> >
> > On Wed, Sep 7, 2016 at 10:27 PM, Pranith Kumar Karampuri
> > <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
> >
> >
> >
> > On Wed, Sep 7, 2016 at 9:46 PM, Mohit Agrawal
> > <moagrawa at redhat.com <mailto:moagrawa at redhat.com>> wrote:
> >
> > Hi Pranith,
> >
> >
> > In current approach i am getting list of xattr from
> > first up volume and update the user attributes from that
> > xattr to
> > all other volumes.
> >
> > I have assumed first up subvol is source and rest of
> > them are sink as we are doing same in dht_dir_attr_heal.
> >
> >
> > I think first up subvol is different for different mounts as
> > per my understanding, I could be wrong.
> >
> >
> >
> > Regards
> > Mohit Agrawal
> >
> > On Wed, Sep 7, 2016 at 9:34 PM, Pranith Kumar Karampuri
> > <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
> >
> > hi Mohit,
> > How does dht find which subvolume has the
> > correct list of xattrs? i.e. how does it determine
> > which subvolume is source and which is sink?
> >
> > On Wed, Sep 7, 2016 at 2:35 PM, Mohit Agrawal
> > <moagrawa at redhat.com <mailto:moagrawa at redhat.com>>
> > wrote:
> >
> > Hi,
> >
> > I am trying to find out solution of one
> > problem in dht specific to user xattr healing.
> > I tried to correct it in a same way as we are
> > doing for healing dir attribute but i feel it is
> > not best solution.
> >
> > To find a right way to heal xattr i want to
> > discuss with you if anyone does have better
> > solution to correct it.
> >
> > Problem:
> > In a distributed volume environment custom
> > extended attribute value for a directory does
> > not display correct value after stop/start the
> > brick. If any extended attribute value is set
> > for a directory after stop the brick the
> > attribute value is not updated on brick after
> > start the brick.
> >
> > Current approach:
> > 1) function set_user_xattr to store user
> > extended attribute in dictionary
> > 2) function dht_dir_xattr_heal call
> > syncop_setxattr to update the attribute on all
> > volume
> > 3) Call the function (dht_dir_xattr_heal)
> > for every directory lookup in
> > dht_lookup_revalidate_cbk
> >
> > Psuedocode for function dht_dir_xatt_heal is
> > like below
> >
> > 1) First it will fetch atttributes from first
> > up volume and store into xattr.
> > 2) Run loop on all subvolume and fetch
> > existing attributes from every volume
> > 3) Replace user attributes from current
> > attributes with xattr user attributes
> > 4) Set latest extended attributes(current +
> > old user attributes) inot subvol.
> >
> >
> > In this current approach problem is
> >
> > 1) it will call heal
> > function(dht_dir_xattr_heal) for every directory
> > lookup without comparing xattr.
> > 2) The function internally call syncop xattr
> > for every subvolume that would be a expensive
> > operation.
> >
> > I have one another way like below to correct
> > it but again in this one it does have dependency
> > on time (not sure time is synch on all bricks or
> > not)
> >
> > 1) At the time of set extended
> > attribute(setxattr) change time in metadata at
> > server side
> > 2) Compare change time before call healing
> > function in dht_revalidate_cbk
> >
> > Please share your input on this.
> > Appreciate your input.
> >
> > Regards
> > Mohit Agrawal
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > <mailto:Gluster-devel at gluster.org>
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > <http://www.gluster.org/mailman/listinfo/gluster-devel>
> >
> >
> >
> >
> > --
> > Pranith
> >
> >
> >
> >
> >
> > --
> > Pranith
> >
> >
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > <http://www.gluster.org/mailman/listinfo/gluster-devel>
> >
> >
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > <http://www.gluster.org/mailman/listinfo/gluster-devel>
> >
> >
> >
> >
> > --
> > Raghavendra G
> >
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> >
>
More information about the Gluster-devel
mailing list