<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 12/12/19 4:01 am, WK wrote:<br>
</div>
<blockquote type="cite"
cite="mid:77d643fc-ad9e-d5d2-6815-65512609c2ea@bneit.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<p><BUMP> so I can get some sort of resolution on the issue
(i.e. is it hardware, Gluster etc)<br>
</p>
<p>I guess what I really need to know is <br>
</p>
<p>1) Node 2 complains that it cant reach node 1 and node 3. If
this was an OS/Hardware networking issue and not internal to
Gluster , then why didn't node1 and node3 have error message
complaining about not reaching node2 <br>
</p>
</blockquote>
<tt>[2019-12-05 22:00:43.739804] C
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]
0-GL1image-client-2: server 10.255.1.1:49153 has not responded in
the last 21 seconds, disconnecting.</tt><tt><br>
</tt><tt> [2019-12-05 22:00:43.757095] C
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]
0-GL1image-client-1: server 10.255.1.3:49152 has not responded in
the last 21 seconds, disconnecting.</tt><tt><br>
</tt><tt> [2019-12-05 22:00:43.757191] I [MSGID: 114018]
[client.c:2323:client_rpc_notify] 0-GL1image-client-2:
disconnected from GL1image-client-2. Client process will keep
trying to connect to glusterd until brick's port is available</tt><tt><br>
</tt><tt> [2019-12-05 22:00:43.757246] I [MSGID: 114018]
[client.c:2323:client_rpc_notify] 0-GL1image-client-1:
disconnected from GL1image-client-1. Client process will keep
trying to connect to glusterd until brick's port is available</tt><tt><br>
</tt>
<p><tt> [2019-12-05 22:00:43.757266] W [MSGID: 108001]
[afr-common.c:5608:afr_notify] 0-GL1image-replicate-0:
Client-quorum is not met</tt></p>
<p>This seems to indicate the mount on node 2 cannot reach 2 bricks.
If quorum is not met, you will get ENOTCONN on the mount. Maybe
check if the mount is still disconnected from the bricks (either
statedump or looking at the .meta folder)?<br>
</p>
<blockquote type="cite"
cite="mid:77d643fc-ad9e-d5d2-6815-65512609c2ea@bneit.com">
<p> </p>
<div class="moz-forward-container">2) how significant is it that
the node was running 6.5 while node 1 and node 2 were running
6.4</div>
</blockquote>
<p>Minor versions should be fine but it is always a good idea to
have all nodes on the same version.</p>
HTH,<br>
Ravi<br>
<blockquote type="cite"
cite="mid:77d643fc-ad9e-d5d2-6815-65512609c2ea@bneit.com">
<div class="moz-forward-container"><br>
</div>
<div class="moz-forward-container">-wk<br>
</div>
<div class="moz-forward-container"><br>
-------- Forwarded Message --------
<table class="moz-email-headers-table" cellspacing="0"
cellpadding="0" border="0">
<tbody>
<tr>
<th valign="BASELINE" nowrap="nowrap" align="RIGHT">Subject:
</th>
<td>VM freeze issue on simple gluster setup.</td>
</tr>
<tr>
<th valign="BASELINE" nowrap="nowrap" align="RIGHT">Date:
</th>
<td>Thu, 5 Dec 2019 16:23:35 -0800</td>
</tr>
<tr>
<th valign="BASELINE" nowrap="nowrap" align="RIGHT">From:
</th>
<td>WK <a class="moz-txt-link-rfc2396E"
href="mailto:wkmail@bneit.com" moz-do-not-send="true"><wkmail@bneit.com></a></td>
</tr>
<tr>
<th valign="BASELINE" nowrap="nowrap" align="RIGHT">To: </th>
<td>Gluster Users <a class="moz-txt-link-rfc2396E"
href="mailto:gluster-users@gluster.org"
moz-do-not-send="true"><gluster-users@gluster.org></a></td>
</tr>
</tbody>
</table>
<br>
<br>
I have a replica2+arbiter setup that is used for VMs.<br>
<br>
ip #.1 is the arb<br>
<br>
ip #.2 and #.3 are the kvm hosts.<br>
<br>
Two Volumes are involved and its gluster 6.5/Ubuntu 18.4/fuse
The Gluster networking uses a two ethernet card
teamd/round-robin setup which *should* have stayed up if one of
the ports had failed.<br>
<br>
I just had a number of VMs go Read-Only due to the below
communication failure at 22:00 but only on kvm host #2<br>
<br>
VMs on the same gluster volumes on kvm host 3 were unaffected.<br>
<br>
The logs on host #2 show the following:<br>
<br>
[2019-12-05 22:00:43.739804] C
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]
0-GL1image-client-2: server 10.255.1.1:49153 has not responded
in the last 21 seconds, disconnecting.<br>
[2019-12-05 22:00:43.757095] C
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]
0-GL1image-client-1: server 10.255.1.3:49152 has not responded
in the last 21 seconds, disconnecting.<br>
[2019-12-05 22:00:43.757191] I [MSGID: 114018]
[client.c:2323:client_rpc_notify] 0-GL1image-client-2:
disconnected from GL1image-client-2. Client process will keep
trying to connect to glusterd until brick's port is available<br>
[2019-12-05 22:00:43.757246] I [MSGID: 114018]
[client.c:2323:client_rpc_notify] 0-GL1image-client-1:
disconnected from GL1image-client-1. Client process will keep
trying to connect to glusterd until brick's port is available<br>
[2019-12-05 22:00:43.757266] W [MSGID: 108001]
[afr-common.c:5608:afr_notify] 0-GL1image-replicate-0:
Client-quorum is not met<br>
[2019-12-05 22:00:43.790639] E
[rpc-clnt.c:346:saved_frames_unwind] (-->
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890]
))))) 0-GL1image-client-2: forced unwinding frame type(GlusterFS
4.x v1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736456
(xid=0x825bffb)<br>
[2019-12-05 22:00:43.790655] W [MSGID: 114031]
[client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk]
0-GL1image-client-2: remote operation failed<br>
[2019-12-05 22:00:43.790686] E
[rpc-clnt.c:346:saved_frames_unwind] (-->
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890]
))))) 0-GL1image-client-1: forced unwinding frame type(GlusterFS
4.x v1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736428
(xid=0x89fee01)<br>
[2019-12-05 22:00:43.790703] W [MSGID: 114031]
[client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk]
0-GL1image-client-1: remote operation failed<br>
[2019-12-05 22:00:43.790774] E [MSGID: 114031]
[client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk]
0-GL1image-client-1: remote operation failed [Transport endpoint
is not connected]<br>
[2019-12-05 22:00:43.790777] E
[rpc-clnt.c:346:saved_frames_unwind] (-->
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45]
(-->
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890]
))))) 0-GL1image-client-2: forced unwinding frame type(GlusterFS
4.x v1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736542
(xid=0x825bffc)<br>
[2019-12-05 22:00:43.790794] W [MSGID: 114029]
[client-rpc-fops_v2.c:4873:client4_0_finodelk]
0-GL1image-client-1: failed to send the fop<br>
[2019-12-05 22:00:43.790806] W [MSGID: 114031]
[client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk]
0-GL1image-client-2: remote operation failed<br>
[2019-12-05 22:00:43.790825] E [MSGID: 114031]
[client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk]
0-GL1image-client-2: remote operation failed [Transport endpoint
is not connected]<br>
[2019-12-05 22:00:43.790842] W [MSGID: 114029]
[client-rpc-fops_v2.c:4873:client4_0_finodelk]
0-GL1image-client-2: failed to send the fop<br>
<br>
the fop/transport not connected errors just repeat for another
50 lines or so until I hit 22:00:46 seconds at which point the
Volumes appear to be fine (though the VMs were still read-only
until I rebooted.<br>
<br>
[2019-12-05 22:00:46.987242] W
[fuse-bridge.c:2827:fuse_readv_cbk] 0-glusterfs-fuse: 91701328:
READ => -1 gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4
fd=0x7f02f005b708 (Transport endpoint is not connected)<br>
[2019-12-05 22:00:47.029947] W
[fuse-bridge.c:2827:fuse_readv_cbk] 0-glusterfs-fuse: 91701329:
READ => -1 gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4
fd=0x7f02f005b708 (Transport endpoint is not connected)<br>
[2019-12-05 22:00:49.901075] W
[fuse-bridge.c:2827:fuse_readv_cbk] 0-glusterfs-fuse: 91701330:
READ => -1 gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956
fd=0x7f02f002bee8 (Transport endpoint is not connected)<br>
[2019-12-05 22:00:49.923525] W
[fuse-bridge.c:2827:fuse_readv_cbk] 0-glusterfs-fuse: 91701331:
READ => -1 gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956
fd=0x7f02f002bee8 (Transport endpoint is not connected)<br>
[2019-12-05 22:00:49.970219] W
[fuse-bridge.c:2827:fuse_readv_cbk] 0-glusterfs-fuse: 91701332:
READ => -1 gfid=fcec6b7a-ad23-4449-aa09-107e113877a1
fd=0x7f02f008dd58 (Transport endpoint is not connected)<br>
[2019-12-05 22:00:50.023932] W
[fuse-bridge.c:2827:fuse_readv_cbk] 0-glusterfs-fuse: 91701333:
READ => -1 gfid=fcec6b7a-ad23-4449-aa09-107e113877a1
fd=0x7f02f008dd58 (Transport endpoint is not connected)<br>
[2019-12-05 22:00:54.807833] I
[rpc-clnt.c:2028:rpc_clnt_reconfig] 0-GL1image-client-2:
changing port to 49153 (from 0)<br>
[2019-12-05 22:00:54.808043] I
[rpc-clnt.c:2028:rpc_clnt_reconfig] 0-GL1image-client-1:
changing port to 49152 (from 0)<br>
[2019-12-05 22:00:46.115076] E [MSGID: 133014]
[shard.c:1799:shard_common_stat_cbk] 0-GL1image-shard: stat
failed: 7a5959d6-75fc-411d-8831-57a744776ed3 [Transport endpoint
is not connected]<br>
[2019-12-05 22:00:54.820394] I [MSGID: 114046]
[client-handshake.c:1106:client_setvolume_cbk]
0-GL1image-client-1: Connected to GL1image-client-1, attached to
remote volume '/GLUSTER/GL1image'.<br>
[2019-12-05 22:00:54.820447] I [MSGID: 114042]
[client-handshake.c:930:client_post_handshake]
0-GL1image-client-1: 10 fds open - Delaying child_up until they
are re-opened<br>
[2019-12-05 22:00:54.820549] I [MSGID: 114046]
[client-handshake.c:1106:client_setvolume_cbk]
0-GL1image-client-2: Connected to GL1image-client-2, attached to
remote volume '/GLUSTER/GL1image'.<br>
[2019-12-05 22:00:54.820568] I [MSGID: 114042]
[client-handshake.c:930:client_post_handshake]
0-GL1image-client-2: 10 fds open - Delaying child_up until they
are re-opened<br>
[2019-12-05 22:00:54.821381] I [MSGID: 114041]
[client-handshake.c:318:client_child_up_reopen_done]
0-GL1image-client-1: last fd open'd/lock-self-heal'd - notifying
CHILD-UP<br>
[2019-12-05 22:00:54.821406] I [MSGID: 108002]
[afr-common.c:5602:afr_notify] 0-GL1image-replicate-0:
Client-quorum is met<br>
[2019-12-05 22:00:54.821446] I [MSGID: 114041]
[client-handshake.c:318:client_child_up_reopen_done]
0-GL1image-client-2: last fd open'd/lock-self-heal'd - notifying
CHILD-UP<br>
<br>
What is odd is that the gluster logs on the #3 and #1 show
absolutely ZERO gluster errors around that time nor do I show
any Network/teamd errors on any of the 3 nodes (including the
problem node #2)<br>
<br>
I've checked dmesg/syslog and every other log file on the box.<br>
<br>
According to a staff member, we had this same kvm host have the
same problem about 3 weeks ago, it was written up as a fluke
possible due to excess disk I/O, since we have been using
gluster for years and rarely have seen issues, especially with
very basic gluster usage.<br>
<br>
In this case those VMs weren't overly busy and now we have a
repeat problem.<br>
<br>
So I am wondering where else I can look to diagnose the problem
or should I abandon the hardware/setup?<br>
<br>
I assume its a networking issue and not on gluster, but I am
confused why gluster nodes #1 and #3 didn't complain about not
seeing #2? If the networking did drop out should they have
noticed?<br>
<br>
There also doesn't appear to be any visible hard disk issues
(smartd is running)<br>
<br>
Side Note: I have reset the tcp-timeout back to 42 seconds and
will look at upgrading to 6.6. I also see that the ARB and the
unaffected Gluster node were running Gluster 6.4 (I don't know
why #2 is on 6.5 but I am checking on that as well, we turn off
auto-upgrade)<br>
<br>
Maybe the mismatched versions are the culprit?<br>
<br>
Also, we have a large of these replica 2+1 gluster setups
running gluster version from 5.x up and none of the others have
had this issue<br>
<br>
Any advise would be appreciated.<br>
<br>
Sincerely,<br>
<br>
Wk<br>
<br>
<br>
<br>
<br>
<br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">________
Community Meeting Calendar:
APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: <a class="moz-txt-link-freetext" href="https://bluejeans.com/441850968">https://bluejeans.com/441850968</a>
NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: <a class="moz-txt-link-freetext" href="https://bluejeans.com/441850968">https://bluejeans.com/441850968</a>
Gluster-users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a>
<a class="moz-txt-link-freetext" href="https://lists.gluster.org/mailman/listinfo/gluster-users">https://lists.gluster.org/mailman/listinfo/gluster-users</a>
</pre>
</blockquote>
</body>
</html>