[Gluster-devel] [Release-8] Thin-Arbiter: Unique-ID requirement

Amar Tumballi amar at kadalu.io
Tue Feb 4 17:37:35 UTC 2020

On Tue, Jan 14, 2020 at 2:37 PM Atin Mukherjee <atin.mukherjee83 at gmail.com>

> From a design perspective 2 is a better choice. However I'd like to see a
> design on how cluster id will be generated and maintained (with peer
> addition/deletion scenarios, node replacement etc).
Thanks for the feedback Atin.

> On Tue, Jan 14, 2020 at 1:42 PM Amar Tumballi <amar at kadalu.io> wrote:
>> Hello,
>> As we are gearing up for Release-8, and its planning, I wanted to bring
>> up one of my favorite topics, 'Thin-Arbiter' (or Tie-Breaker/Metro-Cluster
>> etc etc).
>> We have made thin-arbiter release in v7.0 itself, which works great, when
>> we have just 1 cluster of gluster. I am talking about a situation which
>> involves multiple gluster clusters, and easier management of thin-arbiter
>> nodes. (Ref: https://github.com/gluster/glusterfs/issues/763)
>> I am working with a goal of hosting a thin-arbiter node service (free of
>> cost), for which any gluster deployment can connect, and save their cost of
>> an additional replica, which is required today to not get into split-brain
>> situation. Tie-breaker storage and process needs are so less that we can
>> easily handle all gluster deployments till date in just one machine. When I
>> looked at the code with this goal, I found that current implementation
>> doesn't support it, mainly because it uses 'volumename' in the file it
>> creates. This is good for 1 cluster, as we don't allow duplicate volume
>> names in a single cluster, or OK for multiple clusters, as long as volume
>> names are not colliding.
>> To resolve this properly we have 2 options (as per my thinking now) to
>> make it truly global service.
>> 1. Add 'volume-id' option in afr volume itself, so, each instance picks
>> the volume-id and uses it in thin-arbiter name. A variant of this is
>> submitted for review - https://review.gluster.org/23723 but as it uses
>> volume-id from io-stats, this particular patch fails in case of brick-mux
>> and shd-mux scenarios.  A proper enhancement of this patch is, providing
>> 'volume-id' option in AFR itself, so glusterd (while generating volfiles)
>> sends the proper vol-id to instance.
>> Pros: Minimal code changes to the above patch.
>> Cons: One more option to AFR (not exposed to users).
>> 2. Add* cluster-id *to glusterd, and pass it to all processes. Let
>> replicate use this in thin-arbiter file. This too will solve the issue.
>> Pros: A cluster-id is good to have in any distributed system, specially
>> when there are deployments which will be 3 node each in different clusters.
>> Identifying bricks, services as part of a cluster is better.
>> Cons: Code changes are more, and in glusterd component.
>> On another note, 1 above is purely for Thin-Arbiter feature only, where
>> as 2nd option would be useful in debugging, and other solutions which
>> involves multiple clusters.
>> Let me know what you all think about this. This is good to be discussed
>> in next week's meeting, and taken to completion.
After some more code reading, and thinking about possible solutions, I
found that there is another simpler solution to get this resolved for
multiple cluster.

Currently thin-arbiter file name for a replica-set is picked from what is
the 3rd (ie, index=2) option in 'pending-xattr' key in volume file. If we
get that key to be unique (say volume-id + index-of-replica-set), this
problem is solved. Needs minimum change in code for glusterfs (actually, no
code change in filesystem part, but only in glusterd-volgen.c).

I tried this approach while providing replica2 option
<https://kadalu.io/rfcs/0003-kadalu-thin-arbiter-support> of kadalu.io
project. The tests are running fine, and I got the expected goal met.


>  I am working with a goal of hosting a thin-arbiter node service (free of
> cost), for which any gluster deployment can connect, and save their cost of
> an additional replica, which is required today to not get into split-brain
> situation.


I am happy to tell, this goal is achieved. We now have
`tie-breaker.kadalu.io:/mnt`, an instance in cloud, for anyone trying to
use a thin-arbiter. If you are not keen to deploy your own instance, you
can use this as thin-arbiter instance. Note that if you are using glusterfs
releases, you may want to wait for patch https://review.gluster.org/24096
to make it to a release (probably 7.3/7.4) to use this in production, till
that time, volume-files generated by glusterd volgen are still using
volumename itself in pending-xattr, hence possible collision of files.


>> Regards,
>> Amar
>> ---
>> https://kadalu.io
>> Storage made easy for k8s
>> _______________________________________________
>> Community Meeting Calendar:
>> APAC Schedule -
>> Every 2nd and 4th Tuesday at 11:30 AM IST
>> Bridge: https://bluejeans.com/441850968
>> NA/EMEA Schedule -
>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>> Bridge: https://bluejeans.com/441850968
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel

Container Storage made easy!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20200204/c7c5f2a4/attachment.html>

More information about the Gluster-devel mailing list