<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">You may be mis-understanding the way the gluster system works in detail here, but you’ve got the right idea overall. Since gluster is maintaining 3 copies of your data, you can lose a drive or a whole system and things will keep going without interruption (well, mostly, if a host node was using the system that just died, it may pause briefly before re-connecting to one that is still running via a backup-server setting or your dns configs). While the system is still going with one node down, that node is falling behind and new disk writes, and the remaining ones are keeping track of what’s changing. Once you repair/recover/reboot the down node, it will rejoin the cluster. Now the recovered system has to catch up, and it does this by having the other two nodes send it the changes. In the meantime, gluster is serving any reads for that data from one of the up to date nodes, even if you ask the one you just restarted. In order to do this healing, it had to lock the files to ensure no changes are made while it copies a chunk of them over the recovered node. When it locks them, your hypervisor notices they have gone read-only, and especially if it has a pending write for that file, may pause the VM because this looks like a storage issue to it. Once the file gets unlocked, it can be written again, and your hypervisor notices and will generally reactivate your VM. You may see delays too, especially if you only have 1G networking between your host nodes while everything is getting copied around. And your files could be being locked, updated, unlocked, locked again a few seconds or minutes later, etc.<div class=""><br class=""></div><div class="">That’s where sharding comes into play, once you have a file broken up into shards, gluster can get away with only locking the particular shard it needs to heal, and leaving the whole disk image unlocked. You may still catch a brief pause if you try and write the specific segment of the file gluster is healing at the moment, but it’s also going to be much faster because it’s a small chuck of the file, and copies quickly.<br class=""><div class=""><br class=""></div><div class="">Also, check out&nbsp;<a href="https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/server-quorum/" class="">https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/server-quorum/</a>, you probably want to set cluster.server-quorum-ratio to 50 for a replica-3 setup to avoid the possibility of split-brains. Your cluster will go write only if it loses two nodes though, but you can always make a change to the server-quorum-ratio later if you need to keep it running temporarily.<br class=""><div><br class=""></div><div>Hope that makes sense of what’s going on for you,</div><div><br class=""></div><div>&nbsp; -Darrell</div><div><br class=""><blockquote type="cite" class=""><div class="">On Aug 23, 2019, at 5:06 PM, Carl Sirotic &lt;<a href="mailto:csirotic@evoqarchitecture.com" class="">csirotic@evoqarchitecture.com</a>&gt; wrote:</div><br class="Apple-interchange-newline"><div class="">

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" class="">

  <div text="#000000" bgcolor="#FFFFFF" class=""><p class="">Okay,</p><p class="">so it means, at least I am not getting the expected behavior and

      there is hope.</p><p class="">I put the quorum settings that I was told a couple of emails ago.</p><p class="">After applying virt group, they are</p><p class="">cluster.quorum-type&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

      auto&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br class="">

      cluster.quorum-count&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

      (null)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br class="">

      cluster.server-quorum-type&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

      server&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br class="">

      cluster.server-quorum-ratio&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

      0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br class="">

      cluster.quorum-reads&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

      no&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br class="">

      <br class="">

    </p><p class="">Also,</p><p class="">I just put the ping timeout to 5 seconds now.</p><p class=""><br class="">

      Carl<br class="">

    </p>

    <div class="moz-cite-prefix">On 2019-08-23 5:45 p.m., Ingo Fischer

      wrote:<br class="">

    </div>

    <blockquote type="cite" cite="mid:WM!70be0689af39e9a9c31a03ebbb8526a85e48659c6f910db0e14f7159e759eff604955b78624896a3e47e88cb0eb836d0!@filter1.lastspam.com" class="">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8" class="">

      Hi Carl,

      <div class=""><br class="">

      </div>

      <div class="">In my understanding and experience (I have a replica 3 System

        running too) this should not happen. Can you tell your client

        and server quorum settings?<br class="">

        <br class="">

        <div dir="ltr" class="">Ingo</div>

        <div dir="ltr" class=""><br class="">

          Am 23.08.2019 um 15:53 schrieb Carl Sirotic &lt;<a href="mailto:csirotic@evoqarchitecture.com" moz-do-not-send="true" class="">csirotic@evoqarchitecture.com</a>&gt;:<br class="">

          <br class="">

        </div>

        <blockquote type="cite" class="">

          <div dir="ltr" class="">

            <meta http-equiv="Content-Type" content="text/html;

              charset=UTF-8" class=""><p class="">However,</p><p class="">I must have misunderstood the whole concept of gluster.</p><p class="">In a replica 3, for me, it's completely unacceptable,

              regardless of the options, that all my VMs go down when I

              reboot one node.</p><p class="">The whole purpose of having a full 3 copy of my data on

              the fly is suposed to be this.</p><p class="">I am in the process of sharding every file.</p><p class="">But even if the healing time would be longer, I would

              still expect a non-sharded replica 3 brick with vm boot

              disk, to not go down if I reboot one of its copy.</p><p class=""><br class="">

            </p><p class="">I am not very impressed by gluster so far.<br class="">

            </p><p class="">Carl<br class="">

            </p>

            <div class="moz-cite-prefix">On 2019-08-19 4:15 p.m.,

              Darrell Budic wrote:<br class="">

            </div>

            <blockquote type="cite" cite="mid:WM!70b2a24c324753176289e0b250d790e7f5ffa931f81e9072ffcc23c4f6fc1a7199617ab90bbd5e0e5170f02e1339ca54!@filter4.lastspam.com" class="">

              <meta http-equiv="Content-Type" content="text/html;

                charset=UTF-8" class="">

              /var/lib/glusterd/groups/virt is a good start for ideas,

              notably some thread settings and choose-local=off to

              improve read performance. If you don’t have at least 10

              cores on your servers, you may want to lower the

              recommended shd-max-threads=8 to no more than half your

              CPU cores to keep healing from swamping out regular work.

              <div class=""><br class="">

              </div>

              <div class="">It’s also starting to depend on what your

                backing store and networking setup are, so you’re going

                to want to test changes and find what works best for

                your setup.</div>

              <div class=""><br class="">

              </div>

              <div class="">In addition to the virt group settings, I

                use these on most of my volumes, SSD or HDD backed, with

                the default 64M shard size:</div>

              <div class=""><br class="">

              </div>

              <div class="">

                <div style="margin: 0px; font-stretch: normal;

                  line-height: normal;" class=""><span style="font-variant-ligatures: no-common-ligatures;

                    background-color: rgba(255, 255, 255, 0);" class=""><a href="http://performance.io/" class="" moz-do-not-send="true">performance.io</a>-thread-count:

                    32<span class="Apple-tab-span" style="white-space:pre">                </span>#

                    seemed good for my system, particularly a ZFS backed

                    volume with lots of spindles</span></div>

                <div style="margin: 0px; font-stretch: normal;

                  line-height: normal;" class=""><span style="font-variant-ligatures: no-common-ligatures;

                    background-color: rgba(255, 255, 255, 0);" class="">client.event-threads:

                    8<span class="Apple-tab-span" style="white-space:pre">                                </span></span></div>

              </div>

              <div class=""><span style="font-variant-ligatures:

                  no-common-ligatures" class="">

                  <div style="margin: 0px; font-stretch: normal;

                    line-height: normal;" class=""><span style="font-variant-ligatures:

                      no-common-ligatures; background-color: rgba(255,

                      255, 255, 0);" class="">cluster.data-self-heal-algorithm:

                      full<span class="Apple-tab-span" style="white-space:pre">        </span>#

                      10G networking, uses more net/less cpu to heal.

                      probably don’t use this for 1G networking?</span></div>

                  <div class=""><span style="font-variant-ligatures:

                      no-common-ligatures" class="">

                      <div style="margin: 0px; font-stretch: normal;

                        line-height: normal;" class=""><span style="font-variant-ligatures:

                          no-common-ligatures; background-color:

                          rgba(255, 255, 255, 0);" class="">performance.stat-prefetch:

                          on</span></div>

                      <div class=""><span style="font-variant-ligatures:

                          no-common-ligatures" class="">

                          <div style="margin: 0px; font-stretch: normal;

                            line-height: normal;" class=""><span style="font-variant-ligatures:

                              no-common-ligatures; background-color:

                              rgba(255, 255, 255, 0);" class="">cluster.read-hash-mode:

                              3<span class="Apple-tab-span" style="white-space:pre">                        </span>#

                              distribute reads to least loaded server

                              (by read queue depth)</span></div>

                          <div class=""><span style="font-variant-ligatures:

                              no-common-ligatures; background-color:

                              rgba(255, 255, 255, 0);" class=""><br class="">

                            </span></div>

                          <div class=""><span style="font-variant-ligatures:

                              no-common-ligatures; background-color:

                              rgba(255, 255, 255, 0);" class="">and

                              these two only on my HDD backed volume:</span></div>

                          <div class=""><span style="font-variant-ligatures:

                              no-common-ligatures; background-color:

                              rgba(255, 255, 255, 0);" class=""><br class="">

                            </span></div>

                          <div class=""><span style="font-variant-ligatures:

                              no-common-ligatures" class="">

                              <div style="margin: 0px; font-stretch:

                                normal; line-height: normal;" class=""><span style="font-variant-ligatures:

                                  no-common-ligatures; background-color:

                                  rgba(255, 255, 255, 0);" class="">performance.cache-size:

                                  1G</span></div>

                              <div style="margin: 0px; font-stretch:

                                normal; line-height: normal;" class=""><span style="font-variant-ligatures:

                                  no-common-ligatures; background-color:

                                  rgba(255, 255, 255, 0);" class="">performance.write-behind-window-size:

                                  64MB</span></div>

                              <div class=""><span style="font-variant-ligatures:

                                  no-common-ligatures" class=""><br class="">

                                </span></div>

                              <div class=""><span style="font-variant-ligatures:

                                  no-common-ligatures" class="">but I

                                  suspect these two need another round

                                  or six of tuning to tell if they are

                                  making a difference.</span></div>

                            </span></div>

                        </span></div>

                    </span></div>

                </span></div>

              <div class=""><br class="">

              </div>

              <div class="">I use the throughput-performance tuned

                profile on my servers, so you should be in good shape

                there.</div>

              <div class="">

                <div class=""><br class="">

                  <blockquote type="cite" class="">

                    <div class="">On Aug 19, 2019, at 12:22 PM, Guy

                      Boisvert &lt;<a href="mailto:guy.boisvert@ingtegration.com" class="" moz-do-not-send="true">guy.boisvert@ingtegration.com</a>&gt;

                      wrote:</div>

                    <br class="Apple-interchange-newline">

                    <div class="">

                      <div class="">On 2019-08-19 12:08 p.m., Darrell

                        Budic wrote:<br class="">

                        <blockquote type="cite" class="">You also need

                          to make sure your volume is setup properly for

                          best performance. Did you apply the gluster

                          virt group to your volumes, or at least

                          features.shard = on on your VM volume?<br class="">

                        </blockquote>

                        <br class="">

                        That's what we did here:<br class="">

                        <br class="">

                        <br class="">

                        gluster volume set W2K16_Rhenium

                        cluster.quorum-type auto<br class="">

                        gluster volume set W2K16_Rhenium

                        network.ping-timeout 10<br class="">

                        gluster volume set W2K16_Rhenium auth.allow \*<br class="">

                        gluster volume set W2K16_Rhenium group virt<br class="">

                        gluster volume set W2K16_Rhenium

                        storage.owner-uid 36<br class="">

                        gluster volume set W2K16_Rhenium

                        storage.owner-gid 36<br class="">

                        gluster volume set W2K16_Rhenium features.shard

                        on<br class="">

                        gluster volume set W2K16_Rhenium

                        features.shard-block-size 256MB<br class="">

                        gluster volume set W2K16_Rhenium

                        cluster.data-self-heal-algorithm full<br class="">

                        gluster volume set W2K16_Rhenium

                        performance.low-prio-threads 32<br class="">

                        <br class="">

                        tuned-adm profile random-io&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; (a profile i

                        added in CentOS 7)<br class="">

                        <br class="">

                        <br class="">

                        cat /usr/lib/tuned/random-io/tuned.conf<br class="">

                        ===========================================<br class="">

                        [main]<br class="">

                        summary=Optimize for Gluster virtual machine

                        storage<br class="">

                        include=throughput-performance<br class="">

                        <br class="">

                        [sysctl]<br class="">

                        <br class="">

                        vm.dirty_ratio = 5<br class="">

                        vm.dirty_background_ratio = 2<br class="">

                        <br class="">

                        <br class="">

                        Any more optimization to add to this?<br class="">

                        <br class="">

                        <br class="">

                        Guy<br class="">

                        <br class="">

                        -- <br class="">

                        Guy Boisvert, ing.<br class="">

                        IngTegration inc.<br class="">

                        <a href="http://www.ingtegration.com/" class="" moz-do-not-send="true">http://www.ingtegration.com</a><br class="">

                        <a class="moz-txt-link-freetext" href="https://www.linkedin.com/in/guy-boisvert-8990487" moz-do-not-send="true">https://www.linkedin.com/in/guy-boisvert-8990487</a><br class="">

                        <br class="">

                        AVIS DE CONFIDENTIALITE : ce message peut

                        contenir des<br class="">

                        renseignements confidentiels appartenant

                        exclusivement a<br class="">

                        IngTegration Inc. ou a ses filiales. Si vous

                        n'etes pas<br class="">

                        le destinataire indique ou prevu dans ce

                        &nbsp;message (ou<br class="">

                        responsable de livrer ce message a la personne

                        indiquee ou<br class="">

                        prevue) ou si vous pensez que ce message vous a

                        ete adresse<br class="">

                        par erreur, vous ne pouvez pas utiliser ou

                        reproduire ce<br class="">

                        message, ni le livrer a quelqu'un d'autre. Dans

                        ce cas, vous<br class="">

                        devez le detruire et vous etes prie d'avertir

                        l'expediteur<br class="">

                        en repondant au courriel.<br class="">

                        <br class="">

                        CONFIDENTIALITY NOTICE :

                        Proprietary/Confidential Information<br class="">

                        belonging to IngTegration Inc. and its

                        affiliates may be<br class="">

                        contained in this message. If you are not a

                        recipient<br class="">

                        indicated or intended in this message (or

                        responsible for<br class="">

                        delivery of this message to such person), or you

                        think for<br class="">

                        any reason that this message may have been

                        addressed to you<br class="">

                        in error, you may not use or copy or deliver

                        this message to<br class="">

                        anyone else. In such case, you should destroy

                        this message<br class="">

                        and are asked to notify the sender by reply

                        email.<br class="">

                        <br class="">

                      </div>

                    </div>

                  </blockquote>

                </div>

                <br class="">

              </div>

            </blockquote>

          </div>

        </blockquote>

        <blockquote type="cite" class="">

          <div dir="ltr" class=""><span class="">_______________________________________________</span><br class="">

            <span class="">Gluster-users mailing list</span><br class="">

            <span class=""><a href="mailto:Gluster-users@gluster.org" moz-do-not-send="true" class="">Gluster-users@gluster.org</a></span><br class="">

            <span class=""><a href="https://lists.gluster.org/mailman/listinfo/gluster-users" moz-do-not-send="true" class="">https://lists.gluster.org/mailman/listinfo/gluster-users</a></span></div>

        </blockquote>

      </div>

    </blockquote>

  </div>

</div></blockquote></div><br class=""></div></div></body></html>