<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 17/04/20 10:35 am, Amar Tumballi
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAKknVp2OVZTdmn+S7eQ7Rh3gQMs2tHN-FtfOt6_fVvJxOru9FQ@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">This thread has been one of the largest effort to
        stabilize the systems in recent times.
        <div><br>
        </div>
        <div>Thanks for patience and number of retries you did, Erik!</div>
      </div>
    </blockquote>
    <tt>Thanks indeed! Once
      <a class="moz-txt-link-freetext" href="https://review.gluster.org/#/c/glusterfs/+/24316/">https://review.gluster.org/#/c/glusterfs/+/24316/</a> gets merged on
      master, I will back port it to the release branches.</tt><br>
    <blockquote type="cite"
cite="mid:CAKknVp2OVZTdmn+S7eQ7Rh3gQMs2tHN-FtfOt6_fVvJxOru9FQ@mail.gmail.com">
      <div dir="ltr">
        <div><br>
        </div>
        <div>We surely need to get to the glitch you found with the 7.4
          version, as with every higher version, we expect more
          stability!</div>
      </div>
    </blockquote>
    <p><tt>True, maybe we should start a separate thread...</tt></p>
    <tt>Regards,</tt><br>
    <tt>Ravi<br>
    </tt>
    <blockquote type="cite"
cite="mid:CAKknVp2OVZTdmn+S7eQ7Rh3gQMs2tHN-FtfOt6_fVvJxOru9FQ@mail.gmail.com">
      <div dir="ltr">
        <div>Regards,</div>
        <div>Amar</div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Fri, Apr 17, 2020 at 2:46
          AM Erik Jacobson &lt;<a href="mailto:erik.jacobson@hpe.com"
            target="_blank" moz-do-not-send="true">erik.jacobson@hpe.com</a>&gt;
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I
          have some news.<br>
          <br>
          After many many many trials, reboots of gluster servers,
          reboots of nodes...<br>
          in what should have reproduced the issue several times. I
          think we're<br>
          stable.<br>
          <br>
          It appears this glusterfs nfs daemon hang only happens in
          glusterfs74<br>
          and not 72.<br>
          <br>
          So....<br>
          1) Your split-brain patch<br>
          2) performance.parallel-readdir off<br>
          3) glusterfs72<br>
          <br>
          I declare it stable. I can't make it fail: split-brain, hang,
          noor seg fault<br>
          with one leader down.<br>
          <br>
          I'm working on putting this in to a SW update.<br>
          <br>
          We are going to test if performance.parallel-readdir off
          impacts booting<br>
          at scale but we don't have a system to try it on at this time.<br>
          <br>
          THAK YOU!<br>
          <br>
          I may have access to the 57 node test system if there is
          something you'd<br>
          like me to try with regards to why glusterfs74 is unstable in
          this<br>
          situation. Just let me know.<br>
          <br>
          Erik<br>
          <br>
          On Thu, Apr 16, 2020 at 12:03:33PM -0500, Erik Jacobson wrote:<br>
          &gt; So in my test runs since making that change, we have a
          different odd<br>
          &gt; behavior now. As you recall, this is with your patch --
          still not<br>
          &gt; split-brain -- and now with performance.parallel-readdir
          off<br>
          &gt; <br>
          &gt; The NFS server grinds to a hault after a few test runs.
          It does not core<br>
          &gt; dump.<br>
          &gt; <br>
          &gt; All that shows up in the log is:<br>
          &gt; <br>
          &gt; "pending frames:" with nothing after it and no date
          stamp.<br>
          &gt; <br>
          &gt; I will start looking for interesting break points I
          guess.<br>
          &gt; <br>
          &gt; <br>
          &gt; The glusterfs for nfs is still alive:<br>
          &gt; <br>
          &gt; root     30541     1 42 09:57 ?        00:51:06
          /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p
          /var/run/gluster/nfs/nfs.pid -l /var/log/glusterfs/nfs.log -S
          /var/run/gluster/9ddb5561058ff543.socket<br>
          &gt; <br>
          &gt; <br>
          &gt; <br>
          &gt; [root@leader3 ~]# strace -f  -p 30541<br>
          &gt; strace: Process 30541 attached with 40 threads<br>
          &gt; [pid 30580] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30579] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30578] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30577] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30576] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30575] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30574] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30573] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30572] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30571] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30570] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30569] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30568] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30567] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30566] futex(0x7f88b8000020, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30565] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30564] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30563] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30562] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30561] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30560] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30559] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30558] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30557] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30556] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30555] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30554] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30553] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30552] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30551] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30550] restart_syscall(&lt;... resuming interrupted
          restart_syscall ...&gt; &lt;unfinished ...&gt;<br>
          &gt; [pid 30549] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30548] futex(0x7f88b8000020, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=0,
          tv_usec=243775} &lt;unfinished ...&gt;<br>
          &gt; [pid 30546] restart_syscall(&lt;... resuming interrupted
          restart_syscall ...&gt; &lt;unfinished ...&gt;<br>
          &gt; [pid 30545] restart_syscall(&lt;... resuming interrupted
          restart_syscall ...&gt; &lt;unfinished ...&gt;<br>
          &gt; [pid 30544] futex(0x7f88b8000020, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30543] rt_sigtimedwait([HUP INT USR1 USR2 TERM], 
          &lt;unfinished ...&gt;<br>
          &gt; [pid 30542] futex(0x7f88b8000020, FUTEX_WAIT_PRIVATE, 2,
          NULL &lt;unfinished ...&gt;<br>
          &gt; [pid 30541] futex(0x7f890c3a39d0, FUTEX_WAIT, 30548, NULL
          &lt;unfinished ...&gt;<br>
          &gt; [pid 30547] &lt;... select resumed&gt; )      = 0
          (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}) = 0 (Timeout)<br>
          &gt; [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1,
          tv_usec=0}^Cstrace: Process 30541 detached<br>
          &gt; strace: Process 30542 detached<br>
          &gt; strace: Process 30543 detached<br>
          &gt; strace: Process 30544 detached<br>
          &gt; strace: Process 30545 detached<br>
          &gt; strace: Process 30546 detached<br>
          &gt; strace: Process 30547 detached<br>
          &gt;  &lt;detached ...&gt;<br>
          &gt; strace: Process 30548 detached<br>
          &gt; strace: Process 30549 detached<br>
          &gt; strace: Process 30550 detached<br>
          &gt; strace: Process 30551 detached<br>
          &gt; strace: Process 30552 detached<br>
          &gt; strace: Process 30553 detached<br>
          &gt; strace: Process 30554 detached<br>
          &gt; strace: Process 30555 detached<br>
          &gt; strace: Process 30556 detached<br>
          &gt; strace: Process 30557 detached<br>
          &gt; strace: Process 30558 detached<br>
          &gt; strace: Process 30559 detached<br>
          &gt; strace: Process 30560 detached<br>
          &gt; strace: Process 30561 detached<br>
          &gt; strace: Process 30562 detached<br>
          &gt; strace: Process 30563 detached<br>
          &gt; strace: Process 30564 detached<br>
          &gt; strace: Process 30565 detached<br>
          &gt; strace: Process 30566 detached<br>
          &gt; strace: Process 30567 detached<br>
          &gt; strace: Process 30568 detached<br>
          &gt; strace: Process 30569 detached<br>
          &gt; strace: Process 30570 detached<br>
          &gt; strace: Process 30571 detached<br>
          &gt; strace: Process 30572 detached<br>
          &gt; strace: Process 30573 detached<br>
          &gt; strace: Process 30574 detached<br>
          &gt; strace: Process 30575 detached<br>
          &gt; strace: Process 30576 detached<br>
          &gt; strace: Process 30577 detached<br>
          &gt; strace: Process 30578 detached<br>
          &gt; strace: Process 30579 detached<br>
          &gt; strace: Process 30580 detached<br>
          &gt; <br>
          &gt; <br>
          &gt; <br>
          &gt; <br>
          &gt; &gt; On 16/04/20 8:04 pm, Erik Jacobson wrote:<br>
          &gt; &gt; &gt; Quick update just on how this got set.<br>
          &gt; &gt; &gt; <br>
          &gt; &gt; &gt; gluster volume set cm_shared
          performance.parallel-readdir on<br>
          &gt; &gt; &gt; <br>
          &gt; &gt; &gt; Is something we did turn on, thinking it might
          make our NFS services<br>
          &gt; &gt; &gt; faster and not knowing about it possibly being
          negative.<br>
          &gt; &gt; &gt; <br>
          &gt; &gt; &gt; Below is a diff of the nfs volume file ON vs
          OFF. So I will simply turn<br>
          &gt; &gt; &gt; this OFF and do a test run.<br>
          &gt; &gt; Yes,that should do it. I am not sure if
          performance.parallel-readdir was<br>
          &gt; &gt; intentionally made to have an effect on gnfs
          volfiles. Usually, for other<br>
          &gt; &gt; performance xlators, `gluster volume set` only
          changes the fuse volfile.<br>
          <br>
          <br>
          ________<br>
          <br>
          <br>
          <br>
          Community Meeting Calendar:<br>
          <br>
          Schedule -<br>
          Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>
          Bridge: <a href="https://bluejeans.com/441850968"
            rel="noreferrer" target="_blank" moz-do-not-send="true">https://bluejeans.com/441850968</a><br>
          <br>
          Gluster-users mailing list<br>
          <a href="mailto:Gluster-users@gluster.org" target="_blank"
            moz-do-not-send="true">Gluster-users@gluster.org</a><br>
          <a
            href="https://lists.gluster.org/mailman/listinfo/gluster-users"
            rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
        </blockquote>
      </div>
      <br clear="all">
      <div><br>
      </div>
      -- <br>
      <div dir="ltr">
        <div dir="ltr">--
          <div><a href="https://kadalu.io" target="_blank"
              moz-do-not-send="true">https://kadalu.io</a></div>
          <div>Container Storage made easy!</div>
          <div><br>
          </div>
        </div>
      </div>
    </blockquote>
  </body>
</html>