<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 12/12/2019 4:34 AM, Ravishankar N

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:69a67e38-851e-7cf4-7a72-4ac802629a39@redhat.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <p><br>

      </p>

      <div class="moz-cite-prefix">On 12/12/19 4:01 am, WK wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:77d643fc-ad9e-d5d2-6815-65512609c2ea@bneit.com">

        <meta http-equiv="content-type" content="text/html;

          charset=UTF-8">

        <p>&lt;BUMP&gt; so I can get some sort of resolution on the

          issue (i.e. is it hardware, Gluster etc)<br>

        </p>

        <p>I guess what I really need to know is <br>

        </p>

        <p>1) Node 2 complains that it cant reach node 1 and node 3.  If

          this was an OS/Hardware networking issue and not internal to

          Gluster , then why didn't node1 and node3 have error message

          complaining about not reaching node2 <br>

        </p>

      </blockquote>

      <tt>[2019-12-05 22:00:43.739804] C

        [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]

        0-GL1image-client-2: server 10.255.1.1:49153 has not responded

        in the last 21 seconds, disconnecting.</tt><tt><br>

      </tt><tt> [2019-12-05 22:00:43.757095] C

        [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]

        0-GL1image-client-1: server 10.255.1.3:49152 has not responded

        in the last 21 seconds, disconnecting.</tt><tt><br>

      </tt><tt> [2019-12-05 22:00:43.757191] I [MSGID: 114018]

        [client.c:2323:client_rpc_notify] 0-GL1image-client-2:

        disconnected from GL1image-client-2. Client process will keep

        trying to connect to glusterd until brick's port is available</tt><tt><br>

      </tt><tt> [2019-12-05 22:00:43.757246] I [MSGID: 114018]

        [client.c:2323:client_rpc_notify] 0-GL1image-client-1:

        disconnected from GL1image-client-1. Client process will keep

        trying to connect to glusterd until brick's port is available</tt><tt><br>

      </tt>

      <p><tt> [2019-12-05 22:00:43.757266] W [MSGID: 108001]

          [afr-common.c:5608:afr_notify] 0-GL1image-replicate-0:

          Client-quorum is not met</tt></p>

      <p>This seems to indicate the mount on node 2 cannot reach 2

        bricks. If quorum is not met, you will get ENOTCONN on the

        mount. Maybe check if the mount is still disconnected from the

        bricks (either statedump or looking at the .meta folder)?<br>

      </p>

    </blockquote>

    <p>ok, it is a localhost fuse mount if that is more needed

      information. Should we be mounting on the actual IP of the Gluster

      Network?<br>

    </p>

    <p>The gluster setup on Node 2 returned to normal 11 seconds later

      with the mount reconnecting and every thing was fine when we were

      finally notfied of a problem and investigated (the VM lockup had

      already occured)<br>

    </p>

    <p>I'm not sure about the client port change from 0 back to 49153.

      Is that a clue? where did port 0 come from?<br>

    </p>

    <p>So is this an OS/Fuse problem with just the Node 2 mount locally

      becoming "confused" and then recovering?<br>

    </p>

    <p>Again Node 1 and Node 3 were happily reaching Node 2 from their

      perspective while this is was occurring. They never lost their

      connection to Node 2 from their perspective. <br>

    </p>

    <p><br>

    </p>

    <p>[2019-12-05 22:00:54.807833] I

      [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-GL1image-client-2: changing

      port to 49153 (from 0)<br>

      [2019-12-05 22:00:54.808043] I [rpc-clnt.c:2028:rpc_clnt_reconfig]

      0-GL1image-client-1: changing port to 49152 (from 0)<br>

      [2019-12-05 22:00:54.820394] I [MSGID: 114046]

      [client-handshake.c:1106:client_setvolume_cbk]

      0-GL1image-client-1: Connected to GL1image-client-1, attached to

      remote volume '/GLUSTER/GL1image'.<br>

      [2019-12-05 22:00:54.820447] I [MSGID: 114042]

      [client-handshake.c:930:client_post_handshake]

      0-GL1image-client-1: 10 fds open - Delaying child_up until they

      are re-opened<br>

      [2019-12-05 22:00:54.820549] I [MSGID: 114046]

      [client-handshake.c:1106:client_setvolume_cbk]

      0-GL1image-client-2: Connected to GL1image-client-2, attached to

      remote volume '/GLUSTER/GL1image'.<br>

      [2019-12-05 22:00:54.820568] I [MSGID: 114042]

      [client-handshake.c:930:client_post_handshake]

      0-GL1image-client-2: 10 fds open - Delaying child_up until they

      are re-opened<br>

      [2019-12-05 22:00:54.821381] I [MSGID: 114041]

      [client-handshake.c:318:client_child_up_reopen_done]

      0-GL1image-client-1: last fd open'd/lock-self-heal'd - notifying

      CHILD-UP<br>

      [2019-12-05 22:00:54.821406] I [MSGID: 108002]

      [afr-common.c:5602:afr_notify] 0-GL1image-replicate-0:

      Client-quorum is met<br>

      [2019-12-05 22:00:54.821446] I [MSGID: 114041]

      [client-handshake.c:318:client_child_up_reopen_done]

      0-GL1image-client-2: last fd open'd/lock-self-heal'd - notifying

      CHILD-UP</p>

    <p><br>

    </p>

    <p>In the meantime, we reupped the timeout to the default of 42

      seconds which would have prevented the VM freeze. I suspect there

      was reason that is the default<br>

    </p>

    <p>-wk<br>

    </p>

    <p><br>

    </p>

    <p><br>

    </p>

    <p><br>

    </p>

    <p><br>

    </p>

    <br>

    <blockquote type="cite"

      cite="mid:69a67e38-851e-7cf4-7a72-4ac802629a39@redhat.com">

      <blockquote type="cite"

        cite="mid:77d643fc-ad9e-d5d2-6815-65512609c2ea@bneit.com">

        <div class="moz-forward-container"> -------- Forwarded Message

          --------

          <table class="moz-email-headers-table" cellspacing="0"

            cellpadding="0" border="0">

            <tbody>

              <tr>

                <th valign="BASELINE" nowrap="nowrap" align="RIGHT">Subject:

                </th>

                <td>VM freeze issue on simple gluster setup.</td>

              </tr>

              <tr>

                <th valign="BASELINE" nowrap="nowrap" align="RIGHT">Date:

                </th>

                <td>Thu, 5 Dec 2019 16:23:35 -0800</td>

              </tr>

              <tr>

                <th valign="BASELINE" nowrap="nowrap" align="RIGHT">From:

                </th>

                <td>WK <a class="moz-txt-link-rfc2396E"

                    href="mailto:wkmail@bneit.com"

                    moz-do-not-send="true">&lt;wkmail@bneit.com&gt;</a></td>

              </tr>

              <tr>

                <th valign="BASELINE" nowrap="nowrap" align="RIGHT">To:

                </th>

                <td>Gluster Users <a class="moz-txt-link-rfc2396E"

                    href="mailto:gluster-users@gluster.org"

                    moz-do-not-send="true">&lt;gluster-users@gluster.org&gt;</a></td>

              </tr>

            </tbody>

          </table>

          <br>

          <br>

          I have a replica2+arbiter setup that is used for VMs.<br>

          <br>

          ip #.1 is the arb<br>

          <br>

          ip #.2 and #.3 are the kvm hosts.<br>

          <br>

          Two Volumes are involved and its gluster 6.5/Ubuntu 18.4/fuse

          The Gluster networking uses a  two ethernet card

          teamd/round-robin setup which *should* have stayed up if one

          of the ports had failed.<br>

          <br>

          I just had a number of VMs go Read-Only due to the below

          communication failure at 22:00 but only on kvm host  #2<br>

          <br>

          VMs on the same gluster volumes on kvm host 3 were unaffected.<br>

          <br>

          The logs on host #2 show the following:<br>

          <br>

          [2019-12-05 22:00:43.739804] C

          [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]

          0-GL1image-client-2: server 10.255.1.1:49153 has not responded

          in the last 21 seconds, disconnecting.<br>

          [2019-12-05 22:00:43.757095] C

          [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]

          0-GL1image-client-1: server 10.255.1.3:49152 has not responded

          in the last 21 seconds, disconnecting.<br>

          [2019-12-05 22:00:43.757191] I [MSGID: 114018]

          [client.c:2323:client_rpc_notify] 0-GL1image-client-2:

          disconnected from GL1image-client-2. Client process will keep

          trying to connect to glusterd until brick's port is available<br>

          [2019-12-05 22:00:43.757246] I [MSGID: 114018]

          [client.c:2323:client_rpc_notify] 0-GL1image-client-1:

          disconnected from GL1image-client-1. Client process will keep

          trying to connect to glusterd until brick's port is available<br>

          [2019-12-05 22:00:43.757266] W [MSGID: 108001]

          [afr-common.c:5608:afr_notify] 0-GL1image-replicate-0:

          Client-quorum is not met<br>

          [2019-12-05 22:00:43.790639] E

          [rpc-clnt.c:346:saved_frames_unwind] (--&gt;

/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59]

          (--&gt;

          /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0]

          (--&gt;

          /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce]

          (--&gt;

/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45]

          (--&gt;

          /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890]

          ))))) 0-GL1image-client-2: forced unwinding frame

          type(GlusterFS 4.x v1) op(FXATTROP(34)) called at 2019-12-05

          22:00:19.736456 (xid=0x825bffb)<br>

          [2019-12-05 22:00:43.790655] W [MSGID: 114031]

          [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk]

          0-GL1image-client-2: remote operation failed<br>

          [2019-12-05 22:00:43.790686] E

          [rpc-clnt.c:346:saved_frames_unwind] (--&gt;

/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59]

          (--&gt;

          /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0]

          (--&gt;

          /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce]

          (--&gt;

/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45]

          (--&gt;

          /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890]

          ))))) 0-GL1image-client-1: forced unwinding frame

          type(GlusterFS 4.x v1) op(FXATTROP(34)) called at 2019-12-05

          22:00:19.736428 (xid=0x89fee01)<br>

          [2019-12-05 22:00:43.790703] W [MSGID: 114031]

          [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk]

          0-GL1image-client-1: remote operation failed<br>

          [2019-12-05 22:00:43.790774] E [MSGID: 114031]

          [client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk]

          0-GL1image-client-1: remote operation failed [Transport

          endpoint is not connected]<br>

          [2019-12-05 22:00:43.790777] E

          [rpc-clnt.c:346:saved_frames_unwind] (--&gt;

/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59]

          (--&gt;

          /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0]

          (--&gt;

          /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce]

          (--&gt;

/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45]

          (--&gt;

          /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890]

          ))))) 0-GL1image-client-2: forced unwinding frame

          type(GlusterFS 4.x v1) op(FXATTROP(34)) called at 2019-12-05

          22:00:19.736542 (xid=0x825bffc)<br>

          [2019-12-05 22:00:43.790794] W [MSGID: 114029]

          [client-rpc-fops_v2.c:4873:client4_0_finodelk]

          0-GL1image-client-1: failed to send the fop<br>

          [2019-12-05 22:00:43.790806] W [MSGID: 114031]

          [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk]

          0-GL1image-client-2: remote operation failed<br>

          [2019-12-05 22:00:43.790825] E [MSGID: 114031]

          [client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk]

          0-GL1image-client-2: remote operation failed [Transport

          endpoint is not connected]<br>

          [2019-12-05 22:00:43.790842] W [MSGID: 114029]

          [client-rpc-fops_v2.c:4873:client4_0_finodelk]

          0-GL1image-client-2: failed to send the fop<br>

          <br>

          the fop/transport not connected errors just repeat for another

          50 lines or so until I hit 22:00:46 seconds at which point the

          Volumes appear to be fine (though the VMs were still read-only

          until I rebooted.<br>

          <br>

          [2019-12-05 22:00:46.987242] W

          [fuse-bridge.c:2827:fuse_readv_cbk] 0-glusterfs-fuse:

          91701328: READ =&gt; -1

          gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4 fd=0x7f02f005b708

          (Transport endpoint is not connected)<br>

          [2019-12-05 22:00:47.029947] W

          [fuse-bridge.c:2827:fuse_readv_cbk] 0-glusterfs-fuse:

          91701329: READ =&gt; -1

          gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4 fd=0x7f02f005b708

          (Transport endpoint is not connected)<br>

          [2019-12-05 22:00:49.901075] W

          [fuse-bridge.c:2827:fuse_readv_cbk] 0-glusterfs-fuse:

          91701330: READ =&gt; -1

          gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956 fd=0x7f02f002bee8

          (Transport endpoint is not connected)<br>

          [2019-12-05 22:00:49.923525] W

          [fuse-bridge.c:2827:fuse_readv_cbk] 0-glusterfs-fuse:

          91701331: READ =&gt; -1

          gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956 fd=0x7f02f002bee8

          (Transport endpoint is not connected)<br>

          [2019-12-05 22:00:49.970219] W

          [fuse-bridge.c:2827:fuse_readv_cbk] 0-glusterfs-fuse:

          91701332: READ =&gt; -1

          gfid=fcec6b7a-ad23-4449-aa09-107e113877a1 fd=0x7f02f008dd58

          (Transport endpoint is not connected)<br>

          [2019-12-05 22:00:50.023932] W

          [fuse-bridge.c:2827:fuse_readv_cbk] 0-glusterfs-fuse:

          91701333: READ =&gt; -1

          gfid=fcec6b7a-ad23-4449-aa09-107e113877a1 fd=0x7f02f008dd58

          (Transport endpoint is not connected)<br>

          [2019-12-05 22:00:54.807833] I

          [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-GL1image-client-2:

          changing port to 49153 (from 0)<br>

          [2019-12-05 22:00:54.808043] I

          [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-GL1image-client-1:

          changing port to 49152 (from 0)<br>

          [2019-12-05 22:00:46.115076] E [MSGID: 133014]

          [shard.c:1799:shard_common_stat_cbk] 0-GL1image-shard: stat

          failed: 7a5959d6-75fc-411d-8831-57a744776ed3 [Transport

          endpoint is not connected]<br>

          [2019-12-05 22:00:54.820394] I [MSGID: 114046]

          [client-handshake.c:1106:client_setvolume_cbk]

          0-GL1image-client-1: Connected to GL1image-client-1, attached

          to remote volume '/GLUSTER/GL1image'.<br>

          [2019-12-05 22:00:54.820447] I [MSGID: 114042]

          [client-handshake.c:930:client_post_handshake]

          0-GL1image-client-1: 10 fds open - Delaying child_up until

          they are re-opened<br>

          [2019-12-05 22:00:54.820549] I [MSGID: 114046]

          [client-handshake.c:1106:client_setvolume_cbk]

          0-GL1image-client-2: Connected to GL1image-client-2, attached

          to remote volume '/GLUSTER/GL1image'.<br>

          [2019-12-05 22:00:54.820568] I [MSGID: 114042]

          [client-handshake.c:930:client_post_handshake]

          0-GL1image-client-2: 10 fds open - Delaying child_up until

          they are re-opened<br>

          [2019-12-05 22:00:54.821381] I [MSGID: 114041]

          [client-handshake.c:318:client_child_up_reopen_done]

          0-GL1image-client-1: last fd open'd/lock-self-heal'd -

          notifying CHILD-UP<br>

          [2019-12-05 22:00:54.821406] I [MSGID: 108002]

          [afr-common.c:5602:afr_notify] 0-GL1image-replicate-0:

          Client-quorum is met<br>

          [2019-12-05 22:00:54.821446] I [MSGID: 114041]

          [client-handshake.c:318:client_child_up_reopen_done]

          0-GL1image-client-2: last fd open'd/lock-self-heal'd -

          notifying CHILD-UP<br>

          <br>

          What is odd is that the gluster logs on the #3 and #1 show

          absolutely ZERO gluster errors around that time nor do I show

          any Network/teamd errors on any of the  3 nodes (including the

          problem node #2)<br>

          <br>

          I've checked dmesg/syslog and every other log file on the box.<br>

          <br>

          According to a staff member, we had this same kvm host have

          the same problem about 3 weeks ago, it was written up as a

          fluke possible due to excess disk I/O, since we have been

          using gluster for years and rarely have seen issues,

          especially with very basic gluster usage.<br>

          <br>

          In this case those VMs weren't overly busy and now we have a

          repeat problem.<br>

          <br>

          So I am wondering where else I can look to diagnose the

          problem or should I abandon the hardware/setup?<br>

          <br>

          I assume its a networking issue and not on gluster, but I am

          confused why gluster nodes #1 and #3 didn't complain about not

          seeing #2? If the networking did drop out should they have

          noticed?<br>

          <br>

          There also doesn't appear to be any visible hard disk issues

          (smartd is running)<br>

          <br>

          Side Note: I have reset the tcp-timeout back to 42 seconds and

          will look at upgrading to 6.6. I also see that the ARB and the

          unaffected Gluster node were running Gluster 6.4 (I don't know

          why #2 is on 6.5 but I am checking on that as well, we turn

          off auto-upgrade)<br>

          <br>

          Maybe the mismatched versions are the culprit?<br>

          <br>

          Also, we have a large of these replica 2+1 gluster setups

          running gluster version from 5.x up and none of the others

          have had this issue<br>

          <br>

          Any advise would be appreciated.<br>

          <br>

          Sincerely,<br>

          <br>

          Wk<br>

          <br>

          <br>

          <br>

          <br>

          <br>

        </div>

        <br>

        <fieldset class="mimeAttachmentHeader"></fieldset>

        <pre class="moz-quote-pre" wrap="">________

Community Meeting Calendar:

APAC Schedule -

Every 2nd and 4th Tuesday at 11:30 AM IST

Bridge: <a class="moz-txt-link-freetext" href="https://bluejeans.com/441850968" moz-do-not-send="true">https://bluejeans.com/441850968</a>

NA/EMEA Schedule -

Every 1st and 3rd Tuesday at 01:00 PM EDT

Bridge: <a class="moz-txt-link-freetext" href="https://bluejeans.com/441850968" moz-do-not-send="true">https://bluejeans.com/441850968</a>

Gluster-users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Gluster-users@gluster.org" moz-do-not-send="true">Gluster-users@gluster.org</a>

<a class="moz-txt-link-freetext" href="https://lists.gluster.org/mailman/listinfo/gluster-users" moz-do-not-send="true">https://lists.gluster.org/mailman/listinfo/gluster-users</a>

</pre>

      </blockquote>

    </blockquote>

  </body>

</html>