[Gluster-devel] ZkFarmer

Tue May 8 04:56:10 UTC 2012

On Mon, May 7, 2012 at 9:27 PM, Ian Latter <ian.latter at midnightcode.org> wrote:
>
>> > Is there anything written up on why you/all want every
>> > node to be completely conscious of every other node?
>> >
>> > I could see a couple of architectures that might work
>> > better (be more scalable) if the config minutiae were
>> > either not necessary to be shared or shared in only
>> > cases where the config minutiae were a dependency.
>>
>> Well, these aren't exactly minutiae.  Everything at file
> or directory level is
>> fully distributed and will remain so.  We're talking only
> about stuff at the
>> volume or server level, which is very little data but very
> broad in scope.
>> Trying to segregate that only adds complexity and
> subtracts convenience,
>> compared to having it equally accessible to (or through)
> any server.
>
> Sorry, I didn't have time this morning to add more detail.
>
> Note that my concern isn't bandwidth, its flexibility; the
> less knowledge needed the more I can do crazy things
> in user land, like running boxes in different data centres
> and randomly power things up and down, randomly re-
> address, randomly replace in-box hardware, load
> balance, NAT, etc.  It makes a dynamic environment
> difficult to construct, for example, when Gluster rejects
> the same volume-id being presented to an existing
> cluster from a new GFID.
>
> But there's no need to go even that complicated, let
> me pull out an example of where shared knowledge
> may be unnecessary;
>
> The work that I was doing in Gluster (pre glusterd) drove
> out one primary "server" which fronted a Replicate
> volume of both its own Distribute volume and that of
> another server or two - themselves serving a single
> Distribute volume.  So the client connected to one
> server for one volume and the rest was black box /
> magic (from the client's perspective - big fast storage
> in many locations); in that case it could be said that
> servers needed some shared knowledge, while the
> clients didn't.
>
> The equivalent configuration in a glusterd world (from
> my experiments) pushed all of the distribute knowledge
> out to the client and I haven't had a response as to how
> to add a replicate on distributed volumes in this model,
> so I've lost replicate.  But in this world, the client must
> know about everything and the server is simply a set
> of served/presented disks (as volumes).  In this
> glusterd world, then, why does any server need to
> know of any other server, if the clients are doing all of
> the heavy lifting?
>
> The additional consideration is where the server both
> consumes and presents, but this would be captured in
> the client side view.  i.e. given where glusterd seems
> to be driving, this knowledge seems to be needed on
> the client side (within glusterfs, not glusterfsd).
>
> To my mind this breaks the gluster architecture that I
> read about 2009, but I need to stress that I didn't get
> a reply to the glusterd architecture question that I
> posted about a month ago;  so I don't know if glusterd
> is currently limiting deployment options because;
>  - there is an intention to drive the heavy lifting to the
>    client (for example for performance reasons in big
>    deployments), or;
>  - there are known limitations in the existing bricks/
>    modules (for example moving files thru distribute),
>    or;
>  - there is ultimately (long term) more flexibility seen
>    in this model (and we're at a midway point between
>    pre glusterd and post so it doesn't feel that way
>    yet), or;
>  - there is an intent to drive out a particular market
>    outcome or match an existing storage model (the
>    gluster presentation was driving towards cloud,
>    and maybe those vendors don't use server side
>    implementations), etc.
>
> As I don't have a clear/big picture in my mind; if I'm
> not considering all of the impacts, then my apologies.
>
>
>> > RE ZK, I have an issue with it not being a binary at
>> > the linux distribution level.  This is the reason I don't
>> > currently have Gluster's geo replication module in
>> > place ..
>>
>> What exactly is your objection to interpreted or JIT
> compiled languages?
>> Performance?  Security?  It's an unusual position, to say
> the least.
>>
>
> Specifically, primarily, space.  Saturn builds GlusterFS
> capacity from a 48 Megabyte Linux distribution and
> adding many Megabytes of Perl and/or Python and/or
> PHP and/or Java for a single script is impractical.
>
> My secondary concern is licensing (specifically in the
> Java run-time environment case).  Hadoop forced my
> hand; GNU's JRE/compiler wasn't up to the task of
> running Hadoop when I last looked at it (about 2 or 3
> years ago now) - well, it could run a 2007 or so
> version but not current ones at that time - so now I
> work with Gluster ..
>
>
>
> Going back to ZkFarmer;
>
> Considering other architectures; it depends on how
> you slice and dice the problem as to how much
> external support you need;
>  > I've long felt that our ways of dealing with cluster
>  > membership and staging of config changes is not
>  > quite as robust and scalable as we might want.
>
> By way of example;
>  The openMosix kernel extensions maintained their
> own information exchange between cluster nodes; if
> a node (ip) was added via the /proc interface, it was
> "in" the cluster.  Therefore cluster membership was
> the hand-off/interface.
>  It could be as simple as a text list on each node, or
> it could be left to a user space daemon which could
> then gate cluster membership - this suited everyone
> with a small cluster.
>  The native daemon (omdiscd) used multicast
> packets to find nodes and then stuff those IP's into
> the /proc interface - this suited everyone with a
> private/dedicated cluster.
>  A colleague and I wrote a TCP variation to allow
> multi-site discovery with SSH public key exchanges
> and IPSEC tunnel establishment as part of the
> gating process - this suited those with a distributed/
> part-time cluster.  To ZooKeeper's point
> (http://zookeeper.apache.org/), the discovery
> protocol that we created was weak and I've since
> found a model/algorithm that allows for far more
> robust discovery.
>
>  The point being that, depending on the final cluster
> architecture for gluster (i.e. all are nodes are peers
> and thus all are cluster members, nodes are client
> or server and both are cluster members, nodes are
> client or server and only clients [or servers] are
> cluster members, etc) there may be simpler cluster
> management options ..
>
>
> Cheers,
>

Reason to keep the volume spec files on all servers is simply to be
fully distributed. No one node or set of nodes should hold the cluster
hostage. Code to keep them in sync over 2 nodes or 20 nodes is
essentially the same.

We are revisiting this situation now because we want to scale to
1000s of nodes potentially. Gluster CLI operations should not time out
or slow down.

If ZK requires proprietary JRE for stability, Java will be NO NO!. We
may not need ZK at all. If we simply decide to centralize the config,
GlusterFS has enough code to handle them. Again Avati will argue that
it is exactly the same code as now. My point is to keep things simple
as we scale. Even if the code base is same, we should still restrict
it to N selected nodes. It is matter of adding config option.

-- 
Anand Babu Periasamy
Blog [ http://www.unlocksmith.org ]
Twitter [ http://twitter.com/abperiasamy ]

Imagination is more important than knowledge --Albert Einstein