[Gluster-users] Replication logic

Sun Dec 27 11:31:00 UTC 2020

>But if I do that, the metadata that are already on the brick will be
>lost. What I was asking, is whether there is a way to "upgrade" the
>arbiter to a full replica without losing the metadata in the meanwhile.

You have a 'replica 3 arbiter 1' volume. When you want to replace the arbiter you will need to do it in several steps:
1) use remove-brick to get rid of the arbiter like this:
gluster volume remove-brick VOLUME replica 2 arbiter:/path/to/brick

The command will reduce from 'replica 3 arbiter 1' to 'replica 2' type of volume. You still have the 2 data bricks left and running.

2) Reuseing the brick is easiest if you just umount, wipe the fs and recreate it. It's far simpler
umount /dev/VG/arbiter-brick
mkfs.xfs -f -i size=512 /dev/VG/arbiter-brick
mount /dev/VG/arbiter-brick
mkdir </path/to/lv/mountpoint>/brick

3) Add the recreated brick
gluster volume add-brick VOLUME replica 3 arbiter:/path/to/lv/mountpoint/brick

4) force a heal
gluster volume heal VOLUME full

>You might ask, why does it matter? If the data needs to be replicated
>to the ex-arbiter brick anyway, also rebuilding the metadata is only
>a very slight overhead. Yes, but if the metadata on the ex-arbiter
>remains intact, any one other brick can go down while the ex-arbiter
>is building up its datastore and the volume will still have quorum.

Arbiter holds only metadata , but it's usefull to have it running. Yet, in both cases (remove-brick + add-brick or replace-brick) you have a moment where some files/dirs won't have metadata on the arbiter. You have to take the risk. And you always got the option to reduce the quorum statically to "1" , so even in replica 2 the survived node will be serving requests from the clients.

>Aha, it's the client writing to the bricks and not the server?  That's
the part that I had not understood.
What you described is the NFS xlator (old legacy gNFS which is disabled by default, but you can recompile) , yet the NFS xlator will try to replicate to all nodes in the cluster simultaneously.

> What are you trying to achieve ? What is your setup (clients,servers,etc) ?

>Now you might think georeplication, but that won't work for a mailstore
>(a) because georeplication is asynchronous, so if mailserver1 suddenly
>goes down and mailserver2 takes over, there will be mail on mailserver1
>that is still missing on mailserver2 and will remain missing until
>mailserver1 comes back up again, and (b) because georeplication (if
>I have understood the docs correctly) only works in one direction,
>so that any mail that arrives on a downstream replica will never be
>propagated to its upstream replicas.

Geo replication is not so slow . Based on my experience it happens quite often by default. I understand that it will be an issue if a mail is missing if Node1 died and the replication hasn't had the time to distribute it. Keep in mind that secondary volumes (a.k.a. slave volume) are in read-only mode by default ... just mentioning it.

>I want to get rid of the arbiter and have three full replicas.
You got 2 options -> remove-brick + add-brick or the old school "replace-brick". In both cases you have a moment where the new brick has some data still replicating and if an old "data" brick fails, you have to change the quorum to "1" untill you fix the issue.

>There are three machines running gluster 8.3 and only using gluster
>as the client (mount -t glusterfs) without nfs or anything else. 
If the node is both Gluster and App , we call it HyperConverged setup. Quite typical usage.

>One machine is in Stockholm, one is in Athens and one in Frankfurt a/M,
>though the latter will eventually migrate to Buenos Aires. That's
>a lot of latency and then the Athens connection is also very slow.
That's a lot of lattency and bandwidth restriction. With regular replica the performance will be quite limited. Reads happen locally (if you use the default value for "cluster.choose-local" option), but writes will go to all bricks and will be confirmed only when all bricks confirm the FOP (file operation) or time out. I'm not sure if there is an option that allows to limit the FUSE operation timeout without touching "network.ping-timeout".

>Now, I've read the docs and I know very well that I am doing things
>way out of "the normal way", but I am willing to trade performance
>for resiliency on the mail server, so if I can get that distributed
>mailstore to work somewhat properly, I don't care at all if new mail
>takes 15 minutes to propagate to the slow Athens node. What's
>important is that all three nodes are perfectly synchronised and that
>mail continues to work seamlessly if any one of them goes down[1].

Erm... in 'replica 3' volume and you got slow bandwidth to Athens, then you might have to check your mail server's timeouts (and bump them) as it might get stuck in "D" state (waiting for I/O) while writing the e-mail.
Most probably every write will be slower than the usual , but the reads should not be affected.

You are definately our of the "normal" , but if the performance is not the highest priority - it should work.

Best Regards,
Strahil Nikolov