[Gluster-users] Replication logic

Sun Dec 27 02:00:33 UTC 2020

> Merry Christmas!

To you too :)

>> I have set up a replica 3 arbiter 1 volume. Is there a way to turn
>> the arbiter into a full replica without breaking the volume and
>> losing the metadata that is already on the arbiter?

> Yes, you have to use "remove-brick" with the option "replica" to reduce 
> the replica count and then reformat the arbiter brick and add it back.

But if I do that, the metadata that are already on the brick will be
lost. What I was asking, is whether there is a way to "upgrade" the
arbiter to a full replica without losing the metadata in the meanwhile.

You might ask, why does it matter? If the data needs to be replicated
to the ex-arbiter brick anyway, also rebuilding the metadata is only
a very slight overhead. Yes, but if the metadata on the ex-arbiter
remains intact, any one other brick can go down while the ex-arbiter
is building up its datastore and the volume will still have quorum.

>> where brick2<->brick3 is a high-speed connection, but brick1<->brick2
>> and brick1<->brick3 are low speed, and data is fed to brick1, is there
>> a way to tell the volume that brick1 should only feed brick2 and let
>> brick2 feed brick3 if (and only if) all three are online, rather than
>> brick1 feeding both brick2 and brick3?

> Erm... this is not how it works. The FUSE client (mount -t glusterfs) 
> is writing to all bricks in the replica volume, not the brick to brick. 

Aha, it's the client writing to the bricks and not the server?  That's
the part that I had not understood.

> What are you trying to achieve ? What is your setup (clients,servers,etc) ?

The goal: a resilient and geographically distributed mailstore. A
mail server is a very dynamic thing, with files being written, moved
and deleted all the time. You can put the mailstore on a SAN and
access it from multiple SMTP and IMAP servers, but if the SAN goes
down, everything is down. What I am trying to do is to distribute
the mailstore over several locations and internet connections that
function completely independently of each-other.

Now you might think georeplication, but that won't work for a mailstore
(a) because georeplication is asynchronous, so if mailserver1 suddenly
goes down and mailserver2 takes over, there will be mail on mailserver1
that is still missing on mailserver2 and will remain missing until
mailserver1 comes back up again, and (b) because georeplication (if
I have understood the docs correctly) only works in one direction,
so that any mail that arrives on a downstream replica will never be
propagated to its upstream replicas.

That's why I'm using a normal synchronous replica, currenly experimenting
and testing with replica 3 arbiter 1. If and when this goes into production,
I want to get rid of the arbiter and have three full replicas.

There are three machines running gluster 8.3 and only using gluster
as the client (mount -t glusterfs) without nfs or anything else. One
machine is in Stockholm, one is in Athens and one in Frankfurt a/M,
though the latter will eventually migrate to Buenos Aires. That's
a lot of latency and then the Athens connection is also very slow.
That's why I asked whether I could configure brick1 (where the data
is now coming in) to only write to brick2 and let brick2 write to
brick3.

Now, I've read the docs and I know very well that I am doing things
way out of "the normal way", but I am willing to trade performance
for resiliency on the mail server, so if I can get that distributed
mailstore to work somewhat properly, I don't care at all if new mail
takes 15 minutes to propagate to the slow Athens node. What's
important is that all three nodes are perfectly synchronised and that
mail continues to work seamlessly if any one of them goes down[1].

Z

[1] Beyond the scope of gluster: with synchronous replication, if mail
is being delivered to one node, it won't be finally accepted by the
mail server until it has also been written to the other online nodes.
This means that if the receiving node goes down or the volume gets out
of quorum before the incoming mail is everywhere on the volume, the
sending mail server will never get an acknowledgement of receipt and
will therefore try to resend the mail later. Thus, if all nodes are
advertised as MX in DNS, the mail will be resent to another node five
minutes later.