[Gluster-users] Adding arbiter on a large existing replica 2 set

Mon Oct 21 13:23:03 UTC 2019

Hi,

The new cluster is set up with two physical servers with HDDs and a VM backed by an all-flash stretched vSAN.
The old cluster will be set up the same way.

The main volume that I'm concerned about usually takes about 20-30 minutes to finish the self-heal, the network is 10Gbps.

Best regards
--
THORGEIR MARTHINUSSEN
Senior Systems Consultant
BASEFARM

-----Original Message-----
From: Strahil <hunter86_bg at yahoo.com<mailto:Strahil%20%3chunter86_bg at yahoo.com%3e>>
To: Thorgeir <thorgeir.marthinussen at basefarm.com<mailto:Thorgeir%20%3cthorgeir.marthinussen at basefarm.com%3e>>, gluster-users <gluster-users at gluster.org<mailto:gluster-users%20%3cgluster-users at gluster.org%3e>>
Subject: Re: [Gluster-users] Adding arbiter on a large existing replica 2 set
Date: Wed, 16 Oct 2019 21:04:50 +0300

Hi Thorgeir,

Did you try adding an arbiter with SSD brick/bricks ?

SSD/NVMe is the best type of storage for an arbiter - yes , it's more expensive but you will need less disks than a data brick .

Of course , arbiter is only one side of the equasion and the time to heal might depend on your data bricks' IOPS.

How much time does a node in the cluster need to heal after being reboot ?

Best Regards,
Strahil Nikolov

On Oct 16, 2019 16:37, Thorgeir Marthinussen <thorgeir.marthinussen at basefarm.com> wrote:
Hi,

We have an old Gluster cluster setup, running a replica 2 across two datacenters, and currently on version 4.1.5

I need to add an arbiter to this setup, but I'm concerned about the performance impact of this on the volumes.

I recently set up a new cluster, for a different purpose, and decided to test adding an arbiter to the volume after adding in some data.
Had a volume with ~435,000 files totaling about 12TB.
Adding the arbiter initiated a heal-operation that took almost 3 hours.

The older cluster, one of the volumes is about 14TB, but ~45,5 million files.

Since arbiter is only concerned about metadata and checksums, I'm concerned about the fact that we have 100 times the amount of files, i.e. 100 times the amount of I/O operations to execute during healing, and possibly 100 times the time which would mean about 12,5 days.

Another "issue" is that the 'gluster volume heal <vol-name> info summary' command seems to "count" all the files, so the command can take a very long time to complete.
The metrics-scraping script I created for us, with a timeout of 110seconds, fails to complete when a volume has over ~800-900 files unsynced (which happens regularily when taking one cluster-node down for patching).

Does anyone have any experience with adding arbiter afterwards, performance impact, time to heal, etc.
Also other ways to get the status on healing.

Any advice would be appreciated.

Best regards
--
THORGEIR MARTHINUSSEN
Senior Systems Consultant
BASEFARM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20191021/96625fb3/attachment.html>