[Gluster-devel] AFR-over-Unify (instead of the other way around) ?

Tue Jun 12 19:00:08 UTC 2007

Hi Cedric,

You are right, I will document AFR's behaviour (and internals to some
extent) more extensively after our next release. Also AFR will work
more intuitivle in the next release (considering the things mentioned by you).
We will also see how we can make AFR-over-unify work best.

Regards
Krishna

On 5/31/07, "Cédric Dufour @ IDIAP Research Institute"
<cedric.dufour at idiap.ch> wrote:
> Hello again,
>
> Thank you very much for your feedback.
>
> @ Krishna: your precisions about AFR's behavior are very useful;
> wouldn't they be appropriate in your translators documentation (wiki)
> page ? I actually struggled to understand why the AFR Cluster Example
> was configured that way (using both a 'brickN' and a 'brickN-afr'); it
> is now clear.
>
> @ Avati: that's what I was thinking of (and expecting); that'd be great!
> Now, if you are into that, would it be possible - from a
> write-operations point of view - to have AFR still think of a unified
> child "binary-wise": as either unavailable (all nodes down), or
> available (at least one node up), provided the unifying scheduler
> allowed "failing-over" from a failing node to the next potentially
> active node (which could be some extension of the
> 'rr.limits.min-free-disk' option, as I understand it; e.g. 'rr.fail-over
> true').
>
> But now I talk and I don't do the coding, don't I !?! :-D
>
> Best regards
>
> Cédric Dufour @ IDIAP Research Institute
>
>
>
> Anand Avati wrote:
> > Currently AFR treats its children's failure as binary.. 0 or 1, if one
> > fails 'completel'y then it shifts over to the other. this behaviour is
> > being fixed and the next release should work smoothly for you. thanks
> > for your input!
> >
> > avati
> >
> > 2007/5/31, Krishna Srinivas <krishna at zresearch.com>:
> >> Hi Cedric,
> >>
> >> This setup (afr over unify) is possible, but afr and unify were not
> >> designed to
> >> work that way. In fact we had thought about this kind of setup.
> >>
> >> AFR's job is to replicate the files among its children. AFR expects
> >> its first child to have all the files (the replicate *:n option creates
> >> atleast one file, which would be on the first child). All the read-only
> >> operations (like readdir() read() getattr() etc) happen on AFR's
> >> first child. This being the case if AFR's children are unify1 and
> >> unify2,
> >> if one of unify1's node goes down, then user will not be able to
> >> access the files that were there in the node that went down.
> >> Suppose the whole of unify1 goes down, then AFR will failover
> >> and do read operations from unify2.
> >>
> >> Regards,
> >> Krishna
> >>
> >> On 5/30/07, "Cédric Dufour @ IDIAP Research Institute"
> >> <cedric.dufour at idiap.ch> wrote:
> >> > Hello,
> >> >
> >> > First: Thanx!
> >> > After spending hours looking for a distributed fault-tolerant
> >> filesystem
> >> > that would allow me to setup LVS+HA without a mandatory (and dreaded)
> >> > "shared block-device", I eventually stumbled on GlusterFS, which just
> >> > sounds the most promising filesystem in regards with: performance,
> >> > fault-tolerance, no-metadata server requirement, ease of installation
> >> > (cf. no kernel patch/module), ease of configuration, and ease of
> >> > recovery (cf. easily accessible underlying filesystem)... Just
> >> great!!!
> >> >
> >> > Simulating a 12-nodes GlusterFS on the same server (cf. Julien Perez
> >> > example), I have successfully unified (cluster/unify) 4 replicated
> >> > (cluster/afr) groups of 3 bricks each (cluster/unify OVER
> >> cluster/afr);
> >> > fault-tolerance is working as expected (and I will warmly welcome the
> >> > AFR auto-healing feature from version 1.4 :-D )
> >> >
> >> > Now, my question/problem/suggestion:
> >> >
> >> > Is it possible to work the forementioned scenario the other way
> >> around,
> >> > in other words: cluster/afr OVER cluster/unify ?
> >> > My test - using the branch 2.4 TLA check-out - *seem* to show the
> >> opposite:
> >> > - if a node is down, the RR-scheduler does not skip to the next
> >> > available node in the unified subvolume
> >> > - after bringing all redundant nodes down (for one given file) and
> >> back
> >> > up again, the system does not behave consistently: bringing just (the
> >> > "right") *one* node down again makes the file/partition unavailable
> >> > again (though the two other nodes are up)
> >> >
> >> > Wouldn't AFR-over-Unify be useful ?
> >> > The benefits would be:
> >> >  - a single node could be added to one of the AFR-ed "unified stripe",
> >> > thus preventing the need to group AFR-ed nodes one-with-another as in
> >> > the Unify-over-AFR case
> >> >  - provided enough nodes AND a "failing-over" scheduler, AFR-healing
> >> > could not be needed, since AFR could "always" find a node to write
> >> to in
> >> > each "unified stripe" (since all nodes would need to be down within a
> >> > "unified stripe" for it to completely fail)
> >> >
> >> > Now, maybe you'll tell me this scenario is inherently flawed :-/
> >> (and I
> >> > just made a fool of myself :-D )
> >> >
> >> > PS: I had a thought about nesting a unified GlusterFS within another
> >> > AFR-ed GlusterFS... but I didn't do it, for I don't think it would be
> >> > very elegant (and performant ?)... unless you tell me that's the
> >> way to
> >> > go :-)
> >> >
> >> > Thank you for your feedback,
> >> >
> >> > --
> >> >
> >> > Cédric Dufour @ IDIAP Research Institute
> >> >
> >> >
> >> > _______________________________________________
> >> > Gluster-devel mailing list
> >> > Gluster-devel at nongnu.org
> >> > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >> >
> >>
> >>
> >> _______________________________________________
> >> Gluster-devel mailing list
> >> Gluster-devel at nongnu.org
> >> http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >>
> >
> >
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>