ian.latter at midnightcode.org
Thu May 10 23:52:43 UTC 2012
Actually, I want to clarify this point;
> But the problem today is that replicate (and
> self-heal) does not understand "partial failure"
> of its subvolumes. If one of the subvolume of
> replicate is a distribute, then today's replicate
> only understands complete failure of the
> distribute set or it assumes everything is
> completely fine.
I haven't seen this in practice .. I have seen
replicate attempt to repair anything that was
"missing" and that both the replicate and the
underlying bricks were still viable storage
layers in that process ...
----- Original Message -----
>From: "Ian Latter" <ian.latter at midnightcode.org>
>To: "Anand Avati" <anand.avati at gmail.com>
>Subject: Re: [Gluster-devel] ZkFarmer
>Date: Fri, 11 May 2012 09:39:58 +1000
> > > Sure, I have my own vol files that do (did) what I wanted
> > > and I was supporting myself (and users); the question
> > > (and the point) is what is the GlusterFS *intent*?
> > The "intent" (more or less - I hate to use the word as it
> can imply a
> > commitment to what I am about to say, but there isn't one)
> is to keep the
> > bricks (server process) dumb and have the intelligence on
> the client side.
> > This is a "rough goal". There are cases where replication
> on the server
> > side is inevitable (in the case of NFS access) but we keep
> the software
> > architecture undisturbed by running a client process on
> the server machine
> > to achieve it.
> [There's a difference between intent and plan/roadmap]
> Okay. Unfortunately I am unable to leverage this - I tried
> to serve a Fuse->GlusterFS client mount point (of a
> Distribute volume) as a GlusterFS posix brick (for a
> Replicate volume) and it wouldn't play ball ..
> > We do plan to support "replication on the server" in the
> future while still
> > retaining the existing software architecture as much as
> possible. This is
> > particularly useful in Hadoop environment where the jobs
> expect write
> > performance of a single copy and expect copy to happen in
> the background.
> > We have the proactive self-heal daemon running on the
> server machines now
> > (which again is a client process which happens to be
> physically placed on
> > the server) which gives us many interesting possibilities
> - i.e, with
> > simple changes where we fool the client side replicate
> translator at the
> > time of transaction initiation that only the closest
> server is up at that
> > point of time and write to it alone, and have the
> proactive self-heal
> > daemon perform the extra copies in the background. This
> would be consistent
> > with other readers as they get directed to the "right"
> version of the file
> > by inspecting the changelogs while the background
> replication is in
> > progress.
> > The intention of the above example is to give a general
> sense of how we
> > want to evolve the architecture (i.e, the "intention" you
> were referring
> > to) - keep the clients intelligent and servers dumb. If
> some intelligence
> > needs to be built on the physical server, tackle it by
> loading a client
> > process there (there are also "pathinfo xattr" kind of
> internal techniques
> > to figure out locality of the clients in a generic way
> without bringing
> > "server sidedness" into them in a harsh way)
> Okay .. But what happened to the "brick" architecture
> of stacking anything on anything? I think you point
> that out here ...
> > I'll
> > > write an rsyncd wrapper myself, to run on top of Gluster,
> > > if the intent is not allow the configuration I'm after
> > > (arbitrary number of disks in one multi-host environment
> > > replicated to an arbitrary number of disks in another
> > > multi-host environment, where ideally each environment
> > > need not sum to the same data capacity, presented in a
> > > single contiguous consumable storage layer to an
> > > arbitrary number of unintelligent clients, that is as
> > > tolerant as I choose it to be including the ability to add
> > > and offline/online and remove storage as I so choose) ..
> > > or switch out the whole solution if Gluster is heading
> > > away from my needs. I just need to know what the
> > > direction is .. I may even be able to help get you
> > > you tell me :)
> > >
> > >
> > There are good and bad in both styles (distribute on top
> v/s replicate on
> > top). Replicate on top gives you much better flexibility
> of configuration.
> > Distribute on top is easier for us developers. As a user I
> would like
> > replicate on top as well. But the problem today is that
> replicate (and
> > self-heal) does not understand "partial failure" of its
> subvolumes. If one
> > of the subvolume of replicate is a distribute, then
> today's replicate only
> > understands complete failure of the distribute set or it
> assumes everything
> > is completely fine. An example is self-healing of
> directory entries. If a
> > file is "missing" in one subvolume because a distribute
> node is temporarily
> > down, replicate has no clue why it is missing (or that it
> should keep away
> > from attempting to self-heal). Along the same lines, it
> does not know that
> > once a server is taken off from its distribute subvolume
> for good that it
> > needs to start recreating missing files.
> Hmm. I loved the brick idea. I don't like perverting it by
> trying to "see through" layers. In that context I can see
> two or three expected outcomes from someone building
> this type of stack (heh: a quick trick brick stack) - when
> a distribute child disappears;
> At the Distribute layer;
> 1) The distribute name space / stat space
> remains in tact, though the content is
> obviously not avail.
> 2) The distribute presentation is pure and true
> of its constituents, showing only the names
> / stats that are online/avail.
> In its standalone case, 2 is probably
> preferable as it allows clean add/start/stop/
> remove capacity.
> At the Replicate layer;
> 3) replication occurs only where the name /
> stat space shows a gap
> 4) the replication occurs at any delta
> I don't think there's a real choice here, even
> if 3 were sensible, what would replicate do if
> there was a local name and even just a remote
> file size change, when there's no local content
> to update; it must be 4.
> In which case, I would expect that a replicate
> on top of a distribute with a missing child would
> suddenly see a delta that it would immediately
> set about repairing.
> > The effort to fix this seems to be big enough to disturb
> the inertia of
> > status quo. If this is fixed, we can definitely adopt a
> > mode in glusterd.
> I'm not sure why there needs to be a "fix" .. wasn't
> the previous behaviour sensible?
> Or, if there is something to "change", then
> bolstering the distribute module might be enough -
> a combination of 1 and 2 above.
> Try this out: what if the Distribute layer maintained
> a full name space on each child, and didn't allow
> "recreation"? Say 3 children, one is broken/offline,
> so that /path/to/child/3/file is missing but is known
> to be missing (internally to Distribute). Then the
> Distribute brick can both not show the name
> space to the parent layers, but can also actively
> prevent manipulation of those files (the parent
> can neither stat /path/to/child/3/file nor unlink, nor
> create/write to it). If this change is meant to be
> permanent, then the administrative act of
> removing the child from distribute will then
> truncate the locked name space, allowing parents
> (be they users or other bricks, like Replicate) to
> act as they please (such as recreating the
> missing files).
> If you adhere to the principles that I thought I
> understood from 2009 or so then you should be
> able to let the users create unforeseen Gluster
> architectures without fear or impact. I.e.
> i) each brick is fully self contained *
> ii) physical bricks are the bread of a brick
> stack sandwich **
> iii) any logical brick can appear above/below
> any other logical brick in a brick stack
> * Not mandating a 1:1 file mapping from layer
> to layer
> ** Eg: the Posix (bottom), Client (bottom),
> Server (top) and NFS (top) are all
> regarded as physical bricks.
> Thus it was my expectation that a dedupe brick
> (being logical) could either go above or below
> a distribute brick (also logical), for example.
> Or that an encryption brick could go on top
> of replicate which was on top of encryption
> which was on top of distribute which was on
> top of encryption on top of posix, for example.
> Or .. am I over simplifying the problem space?
> Ian Latter
> Late night coder ..
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
Late night coder ..
More information about the Gluster-devel