<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p>Yes,</p>
<p>this makes alot of sense.</p>
<p>It's the behavior that I was experiencing that makes no sense.</p>
<p>When one node was shut down, the whole VM cluster locked up.</p>
<p>However, I managed to find that the culprit were the quorum
settings.</p>
<p>I put the quorum at 2 bricks for quorum now, and I am not
experiencing the problem anymore.</p>
<p>All my vm boot disks and data disks are now sharded.</p>
<p>We are on 10gbit networks, when the node comes backs, we do not
see any latency really.</p>
<p><br>
</p>
<p>Carl</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 2019-08-29 3:58 p.m., Darrell Budic
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:WM!3112107ad7c62ddc655bdfed0388a6c03594cfa91c62e11538e9381642474ddc4672c6062aa1ced2e121a007bbbb1734!@filter1.lastspam.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
You may be mis-understanding the way the gluster system works in
detail here, but you’ve got the right idea overall. Since gluster
is maintaining 3 copies of your data, you can lose a drive or a
whole system and things will keep going without interruption
(well, mostly, if a host node was using the system that just died,
it may pause briefly before re-connecting to one that is still
running via a backup-server setting or your dns configs). While
the system is still going with one node down, that node is falling
behind and new disk writes, and the remaining ones are keeping
track of what’s changing. Once you repair/recover/reboot the down
node, it will rejoin the cluster. Now the recovered system has to
catch up, and it does this by having the other two nodes send it
the changes. In the meantime, gluster is serving any reads for
that data from one of the up to date nodes, even if you ask the
one you just restarted. In order to do this healing, it had to
lock the files to ensure no changes are made while it copies a
chunk of them over the recovered node. When it locks them, your
hypervisor notices they have gone read-only, and especially if it
has a pending write for that file, may pause the VM because this
looks like a storage issue to it. Once the file gets unlocked, it
can be written again, and your hypervisor notices and will
generally reactivate your VM. You may see delays too, especially
if you only have 1G networking between your host nodes while
everything is getting copied around. And your files could be being
locked, updated, unlocked, locked again a few seconds or minutes
later, etc.
<div class=""><br class="">
</div>
<div class="">That’s where sharding comes into play, once you have
a file broken up into shards, gluster can get away with only
locking the particular shard it needs to heal, and leaving the
whole disk image unlocked. You may still catch a brief pause if
you try and write the specific segment of the file gluster is
healing at the moment, but it’s also going to be much faster
because it’s a small chuck of the file, and copies quickly.<br
class="">
<div class=""><br class="">
</div>
<div class="">Also, check out <a
href="https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/server-quorum/"
class="" moz-do-not-send="true">https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/server-quorum/</a>,
you probably want to set cluster.server-quorum-ratio to 50 for
a replica-3 setup to avoid the possibility of split-brains.
Your cluster will go write only if it loses two nodes though,
but you can always make a change to the server-quorum-ratio
later if you need to keep it running temporarily.<br class="">
<div><br class="">
</div>
<div>Hope that makes sense of what’s going on for you,</div>
<div><br class="">
</div>
<div> -Darrell</div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 23, 2019, at 5:06 PM, Carl Sirotic
<<a href="mailto:csirotic@evoqarchitecture.com"
class="" moz-do-not-send="true">csirotic@evoqarchitecture.com</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<meta http-equiv="Content-Type" content="text/html;
charset=UTF-8" class="">
<div text="#000000" bgcolor="#FFFFFF" class="">
<p class="">Okay,</p>
<p class="">so it means, at least I am not getting the
expected behavior and there is hope.</p>
<p class="">I put the quorum settings that I was told
a couple of emails ago.</p>
<p class="">After applying virt group, they are</p>
<p class="">cluster.quorum-type
auto <br
class="">
cluster.quorum-count
(null) <br
class="">
cluster.server-quorum-type
server <br
class="">
cluster.server-quorum-ratio
0 <br
class="">
cluster.quorum-reads
no <br
class="">
<br class="">
</p>
<p class="">Also,</p>
<p class="">I just put the ping timeout to 5 seconds
now.</p>
<p class=""><br class="">
Carl<br class="">
</p>
<div class="moz-cite-prefix">On 2019-08-23 5:45 p.m.,
Ingo Fischer wrote:<br class="">
</div>
<blockquote type="cite"
cite="mid:WM!70be0689af39e9a9c31a03ebbb8526a85e48659c6f910db0e14f7159e759eff604955b78624896a3e47e88cb0eb836d0!@filter1.lastspam.com"
class="">
<meta http-equiv="content-type" content="text/html;
charset=UTF-8" class="">
Hi Carl,
<div class=""><br class="">
</div>
<div class="">In my understanding and experience (I
have a replica 3 System running too) this should
not happen. Can you tell your client and server
quorum settings?<br class="">
<br class="">
<div dir="ltr" class="">Ingo</div>
<div dir="ltr" class=""><br class="">
Am 23.08.2019 um 15:53 schrieb Carl Sirotic <<a
href="mailto:csirotic@evoqarchitecture.com"
moz-do-not-send="true" class="">csirotic@evoqarchitecture.com</a>>:<br
class="">
<br class="">
</div>
<blockquote type="cite" class="">
<div dir="ltr" class="">
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8" class="">
<p class="">However,</p>
<p class="">I must have misunderstood the
whole concept of gluster.</p>
<p class="">In a replica 3, for me, it's
completely unacceptable, regardless of the
options, that all my VMs go down when I
reboot one node.</p>
<p class="">The whole purpose of having a full
3 copy of my data on the fly is suposed to
be this.</p>
<p class="">I am in the process of sharding
every file.</p>
<p class="">But even if the healing time would
be longer, I would still expect a
non-sharded replica 3 brick with vm boot
disk, to not go down if I reboot one of its
copy.</p>
<p class=""><br class="">
</p>
<p class="">I am not very impressed by gluster
so far.<br class="">
</p>
<p class="">Carl<br class="">
</p>
<div class="moz-cite-prefix">On 2019-08-19
4:15 p.m., Darrell Budic wrote:<br class="">
</div>
<blockquote type="cite"
cite="mid:WM!70b2a24c324753176289e0b250d790e7f5ffa931f81e9072ffcc23c4f6fc1a7199617ab90bbd5e0e5170f02e1339ca54!@filter4.lastspam.com"
class="">
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8"
class="">
/var/lib/glusterd/groups/virt is a good
start for ideas, notably some thread
settings and choose-local=off to improve
read performance. If you don’t have at least
10 cores on your servers, you may want to
lower the recommended shd-max-threads=8 to
no more than half your CPU cores to keep
healing from swamping out regular work.
<div class=""><br class="">
</div>
<div class="">It’s also starting to depend
on what your backing store and networking
setup are, so you’re going to want to test
changes and find what works best for your
setup.</div>
<div class=""><br class="">
</div>
<div class="">In addition to the virt group
settings, I use these on most of my
volumes, SSD or HDD backed, with the
default 64M shard size:</div>
<div class=""><br class="">
</div>
<div class="">
<div style="margin: 0px; font-stretch:
normal; line-height: normal;" class=""><span
style="font-variant-ligatures:
no-common-ligatures; background-color:
rgba(255, 255, 255, 0);" class=""><a
href="http://performance.io/"
class="" moz-do-not-send="true">performance.io</a>-thread-count:
32<span class="Apple-tab-span" style="white-space:pre">                </span>#
seemed good for my system,
particularly a ZFS backed volume with
lots of spindles</span></div>
<div style="margin: 0px; font-stretch:
normal; line-height: normal;" class=""><span
style="font-variant-ligatures:
no-common-ligatures; background-color:
rgba(255, 255, 255, 0);" class="">client.event-threads:
8<span class="Apple-tab-span" style="white-space:pre">                                </span></span></div>
</div>
<div class=""><span
style="font-variant-ligatures:
no-common-ligatures" class="">
<div style="margin: 0px; font-stretch:
normal; line-height: normal;" class=""><span
style="font-variant-ligatures:
no-common-ligatures;
background-color: rgba(255, 255,
255, 0);" class="">cluster.data-self-heal-algorithm:
full<span class="Apple-tab-span" style="white-space:pre">        </span>#
10G networking, uses more net/less
cpu to heal. probably don’t use this
for 1G networking?</span></div>
<div class=""><span
style="font-variant-ligatures:
no-common-ligatures" class="">
<div style="margin: 0px;
font-stretch: normal; line-height:
normal;" class=""><span
style="font-variant-ligatures:
no-common-ligatures;
background-color: rgba(255, 255,
255, 0);" class="">performance.stat-prefetch:
on</span></div>
<div class=""><span
style="font-variant-ligatures:
no-common-ligatures" class="">
<div style="margin: 0px;
font-stretch: normal;
line-height: normal;" class=""><span
style="font-variant-ligatures: no-common-ligatures; background-color:
rgba(255, 255, 255, 0);"
class="">cluster.read-hash-mode:
3<span class="Apple-tab-span" style="white-space:pre">                        </span>#
distribute reads to least
loaded server (by read queue
depth)</span></div>
<div class=""><span
style="font-variant-ligatures:
no-common-ligatures;
background-color: rgba(255,
255, 255, 0);" class=""><br
class="">
</span></div>
<div class=""><span
style="font-variant-ligatures:
no-common-ligatures;
background-color: rgba(255,
255, 255, 0);" class="">and
these two only on my HDD
backed volume:</span></div>
<div class=""><span
style="font-variant-ligatures:
no-common-ligatures;
background-color: rgba(255,
255, 255, 0);" class=""><br
class="">
</span></div>
<div class=""><span
style="font-variant-ligatures:
no-common-ligatures"
class="">
<div style="margin: 0px;
font-stretch: normal;
line-height: normal;"
class=""><span
style="font-variant-ligatures:
no-common-ligatures;
background-color:
rgba(255, 255, 255, 0);"
class="">performance.cache-size:
1G</span></div>
<div style="margin: 0px;
font-stretch: normal;
line-height: normal;"
class=""><span
style="font-variant-ligatures:
no-common-ligatures;
background-color:
rgba(255, 255, 255, 0);"
class="">performance.write-behind-window-size:
64MB</span></div>
<div class=""><span
style="font-variant-ligatures:
no-common-ligatures"
class=""><br class="">
</span></div>
<div class=""><span
style="font-variant-ligatures:
no-common-ligatures"
class="">but I suspect
these two need another
round or six of tuning
to tell if they are
making a difference.</span></div>
</span></div>
</span></div>
</span></div>
</span></div>
<div class=""><br class="">
</div>
<div class="">I use the
throughput-performance tuned profile on my
servers, so you should be in good shape
there.</div>
<div class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 19, 2019, at
12:22 PM, Guy Boisvert <<a
href="mailto:guy.boisvert@ingtegration.com"
class="" moz-do-not-send="true">guy.boisvert@ingtegration.com</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div class="">On 2019-08-19 12:08
p.m., Darrell Budic wrote:<br
class="">
<blockquote type="cite" class="">You
also need to make sure your
volume is setup properly for
best performance. Did you apply
the gluster virt group to your
volumes, or at least
features.shard = on on your VM
volume?<br class="">
</blockquote>
<br class="">
That's what we did here:<br
class="">
<br class="">
<br class="">
gluster volume set W2K16_Rhenium
cluster.quorum-type auto<br
class="">
gluster volume set W2K16_Rhenium
network.ping-timeout 10<br
class="">
gluster volume set W2K16_Rhenium
auth.allow \*<br class="">
gluster volume set W2K16_Rhenium
group virt<br class="">
gluster volume set W2K16_Rhenium
storage.owner-uid 36<br class="">
gluster volume set W2K16_Rhenium
storage.owner-gid 36<br class="">
gluster volume set W2K16_Rhenium
features.shard on<br class="">
gluster volume set W2K16_Rhenium
features.shard-block-size 256MB<br
class="">
gluster volume set W2K16_Rhenium
cluster.data-self-heal-algorithm
full<br class="">
gluster volume set W2K16_Rhenium
performance.low-prio-threads 32<br
class="">
<br class="">
tuned-adm profile random-io
(a profile i added in CentOS 7)<br
class="">
<br class="">
<br class="">
cat
/usr/lib/tuned/random-io/tuned.conf<br
class="">
===========================================<br class="">
[main]<br class="">
summary=Optimize for Gluster
virtual machine storage<br
class="">
include=throughput-performance<br
class="">
<br class="">
[sysctl]<br class="">
<br class="">
vm.dirty_ratio = 5<br class="">
vm.dirty_background_ratio = 2<br
class="">
<br class="">
<br class="">
Any more optimization to add to
this?<br class="">
<br class="">
<br class="">
Guy<br class="">
<br class="">
-- <br class="">
Guy Boisvert, ing.<br class="">
IngTegration inc.<br class="">
<a
href="http://www.ingtegration.com/"
class="" moz-do-not-send="true">http://www.ingtegration.com</a><br
class="">
<a class="moz-txt-link-freetext"
href="https://www.linkedin.com/in/guy-boisvert-8990487"
moz-do-not-send="true">https://www.linkedin.com/in/guy-boisvert-8990487</a><br
class="">
<br class="">
AVIS DE CONFIDENTIALITE : ce
message peut contenir des<br
class="">
renseignements confidentiels
appartenant exclusivement a<br
class="">
IngTegration Inc. ou a ses
filiales. Si vous n'etes pas<br
class="">
le destinataire indique ou prevu
dans ce message (ou<br class="">
responsable de livrer ce message a
la personne indiquee ou<br
class="">
prevue) ou si vous pensez que ce
message vous a ete adresse<br
class="">
par erreur, vous ne pouvez pas
utiliser ou reproduire ce<br
class="">
message, ni le livrer a quelqu'un
d'autre. Dans ce cas, vous<br
class="">
devez le detruire et vous etes
prie d'avertir l'expediteur<br
class="">
en repondant au courriel.<br
class="">
<br class="">
CONFIDENTIALITY NOTICE :
Proprietary/Confidential
Information<br class="">
belonging to IngTegration Inc. and
its affiliates may be<br class="">
contained in this message. If you
are not a recipient<br class="">
indicated or intended in this
message (or responsible for<br
class="">
delivery of this message to such
person), or you think for<br
class="">
any reason that this message may
have been addressed to you<br
class="">
in error, you may not use or copy
or deliver this message to<br
class="">
anyone else. In such case, you
should destroy this message<br
class="">
and are asked to notify the sender
by reply email.<br class="">
<br class="">
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</blockquote>
</div>
</blockquote>
<blockquote type="cite" class="">
<div dir="ltr" class=""><span class="">_______________________________________________</span><br
class="">
<span class="">Gluster-users mailing list</span><br
class="">
<span class=""><a
href="mailto:Gluster-users@gluster.org"
moz-do-not-send="true" class="">Gluster-users@gluster.org</a></span><br
class="">
<span class=""><a
href="https://lists.gluster.org/mailman/listinfo/gluster-users"
moz-do-not-send="true" class="">https://lists.gluster.org/mailman/listinfo/gluster-users</a></span></div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</blockquote>
</body>
</html>