[Gluster-users] Fwd: VM freeze issue on simple gluster setup.

Thu Dec 12 18:38:09 UTC 2019

On 12/12/2019 4:34 AM, Ravishankar N wrote:
>
>
> On 12/12/19 4:01 am, WK wrote:
>>
>> <BUMP> so I can get some sort of resolution on the issue (i.e. is it 
>> hardware, Gluster etc)
>>
>> I guess what I really need to know is
>>
>> 1) Node 2 complains that it cant reach node 1 and node 3.  If this 
>> was an OS/Hardware networking issue and not internal to Gluster , 
>> then why didn't node1 and node3 have error message complaining about 
>> not reaching node2
>>
> [2019-12-05 22:00:43.739804] C 
> [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-2: 
> server 10.255.1.1:49153 has not responded in the last 21 seconds, 
> disconnecting.
> [2019-12-05 22:00:43.757095] C 
> [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-1: 
> server 10.255.1.3:49152 has not responded in the last 21 seconds, 
> disconnecting.
> [2019-12-05 22:00:43.757191] I [MSGID: 114018] 
> [client.c:2323:client_rpc_notify] 0-GL1image-client-2: disconnected 
> from GL1image-client-2. Client process will keep trying to connect to 
> glusterd until brick's port is available
> [2019-12-05 22:00:43.757246] I [MSGID: 114018] 
> [client.c:2323:client_rpc_notify] 0-GL1image-client-1: disconnected 
> from GL1image-client-1. Client process will keep trying to connect to 
> glusterd until brick's port is available
>
> [2019-12-05 22:00:43.757266] W [MSGID: 108001] 
> [afr-common.c:5608:afr_notify] 0-GL1image-replicate-0: Client-quorum 
> is not met
>
> This seems to indicate the mount on node 2 cannot reach 2 bricks. If 
> quorum is not met, you will get ENOTCONN on the mount. Maybe check if 
> the mount is still disconnected from the bricks (either statedump or 
> looking at the .meta folder)?
>
ok, it is a localhost fuse mount if that is more needed information. 
Should we be mounting on the actual IP of the Gluster Network?

The gluster setup on Node 2 returned to normal 11 seconds later with the 
mount reconnecting and every thing was fine when we were finally notfied 
of a problem and investigated (the VM lockup had already occured)

I'm not sure about the client port change from 0 back to 49153. Is that 
a clue? where did port 0 come from?

So is this an OS/Fuse problem with just the Node 2 mount locally 
becoming "confused" and then recovering?

Again Node 1 and Node 3 were happily reaching Node 2 from their 
perspective while this is was occurring. They never lost their 
connection to Node 2 from their perspective.

[2019-12-05 22:00:54.807833] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 
0-GL1image-client-2: changing port to 49153 (from 0)
[2019-12-05 22:00:54.808043] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 
0-GL1image-client-1: changing port to 49152 (from 0)
[2019-12-05 22:00:54.820394] I [MSGID: 114046] 
[client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-1: 
Connected to GL1image-client-1, attached to remote volume 
'/GLUSTER/GL1image'.
[2019-12-05 22:00:54.820447] I [MSGID: 114042] 
[client-handshake.c:930:client_post_handshake] 0-GL1image-client-1: 10 
fds open - Delaying child_up until they are re-opened
[2019-12-05 22:00:54.820549] I [MSGID: 114046] 
[client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-2: 
Connected to GL1image-client-2, attached to remote volume 
'/GLUSTER/GL1image'.
[2019-12-05 22:00:54.820568] I [MSGID: 114042] 
[client-handshake.c:930:client_post_handshake] 0-GL1image-client-2: 10 
fds open - Delaying child_up until they are re-opened
[2019-12-05 22:00:54.821381] I [MSGID: 114041] 
[client-handshake.c:318:client_child_up_reopen_done] 
0-GL1image-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2019-12-05 22:00:54.821406] I [MSGID: 108002] 
[afr-common.c:5602:afr_notify] 0-GL1image-replicate-0: Client-quorum is met
[2019-12-05 22:00:54.821446] I [MSGID: 114041] 
[client-handshake.c:318:client_child_up_reopen_done] 
0-GL1image-client-2: last fd open'd/lock-self-heal'd - notifying CHILD-UP

In the meantime, we reupped the timeout to the default of 42 seconds 
which would have prevented the VM freeze. I suspect there was reason 
that is the default

-wk

>> -------- Forwarded Message --------
>> Subject: 	VM freeze issue on simple gluster setup.
>> Date: 	Thu, 5 Dec 2019 16:23:35 -0800
>> From: 	WK <wkmail at bneit.com>
>> To: 	Gluster Users <gluster-users at gluster.org>
>>
>>
>>
>> I have a replica2+arbiter setup that is used for VMs.
>>
>> ip #.1 is the arb
>>
>> ip #.2 and #.3 are the kvm hosts.
>>
>> Two Volumes are involved and its gluster 6.5/Ubuntu 18.4/fuse The 
>> Gluster networking uses a  two ethernet card teamd/round-robin setup 
>> which *should* have stayed up if one of the ports had failed.
>>
>> I just had a number of VMs go Read-Only due to the below 
>> communication failure at 22:00 but only on kvm host  #2
>>
>> VMs on the same gluster volumes on kvm host 3 were unaffected.
>>
>> The logs on host #2 show the following:
>>
>> [2019-12-05 22:00:43.739804] C 
>> [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 
>> 0-GL1image-client-2: server 10.255.1.1:49153 has not responded in the 
>> last 21 seconds, disconnecting.
>> [2019-12-05 22:00:43.757095] C 
>> [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 
>> 0-GL1image-client-1: server 10.255.1.3:49152 has not responded in the 
>> last 21 seconds, disconnecting.
>> [2019-12-05 22:00:43.757191] I [MSGID: 114018] 
>> [client.c:2323:client_rpc_notify] 0-GL1image-client-2: disconnected 
>> from GL1image-client-2. Client process will keep trying to connect to 
>> glusterd until brick's port is available
>> [2019-12-05 22:00:43.757246] I [MSGID: 114018] 
>> [client.c:2323:client_rpc_notify] 0-GL1image-client-1: disconnected 
>> from GL1image-client-1. Client process will keep trying to connect to 
>> glusterd until brick's port is available
>> [2019-12-05 22:00:43.757266] W [MSGID: 108001] 
>> [afr-common.c:5608:afr_notify] 0-GL1image-replicate-0: Client-quorum 
>> is not met
>> [2019-12-05 22:00:43.790639] E [rpc-clnt.c:346:saved_frames_unwind] 
>> (--> 
>> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59] 
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0] 
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce] 
>> (--> 
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45] 
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890] 
>> ))))) 0-GL1image-client-2: forced unwinding frame type(GlusterFS 4.x 
>> v1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736456 (xid=0x825bffb)
>> [2019-12-05 22:00:43.790655] W [MSGID: 114031] 
>> [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 
>> 0-GL1image-client-2: remote operation failed
>> [2019-12-05 22:00:43.790686] E [rpc-clnt.c:346:saved_frames_unwind] 
>> (--> 
>> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59] 
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0] 
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce] 
>> (--> 
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45] 
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890] 
>> ))))) 0-GL1image-client-1: forced unwinding frame type(GlusterFS 4.x 
>> v1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736428 (xid=0x89fee01)
>> [2019-12-05 22:00:43.790703] W [MSGID: 114031] 
>> [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 
>> 0-GL1image-client-1: remote operation failed
>> [2019-12-05 22:00:43.790774] E [MSGID: 114031] 
>> [client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk] 
>> 0-GL1image-client-1: remote operation failed [Transport endpoint is 
>> not connected]
>> [2019-12-05 22:00:43.790777] E [rpc-clnt.c:346:saved_frames_unwind] 
>> (--> 
>> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59] 
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0] 
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce] 
>> (--> 
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45] 
>> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890] 
>> ))))) 0-GL1image-client-2: forced unwinding frame type(GlusterFS 4.x 
>> v1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736542 (xid=0x825bffc)
>> [2019-12-05 22:00:43.790794] W [MSGID: 114029] 
>> [client-rpc-fops_v2.c:4873:client4_0_finodelk] 0-GL1image-client-1: 
>> failed to send the fop
>> [2019-12-05 22:00:43.790806] W [MSGID: 114031] 
>> [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 
>> 0-GL1image-client-2: remote operation failed
>> [2019-12-05 22:00:43.790825] E [MSGID: 114031] 
>> [client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk] 
>> 0-GL1image-client-2: remote operation failed [Transport endpoint is 
>> not connected]
>> [2019-12-05 22:00:43.790842] W [MSGID: 114029] 
>> [client-rpc-fops_v2.c:4873:client4_0_finodelk] 0-GL1image-client-2: 
>> failed to send the fop
>>
>> the fop/transport not connected errors just repeat for another 50 
>> lines or so until I hit 22:00:46 seconds at which point the Volumes 
>> appear to be fine (though the VMs were still read-only until I rebooted.
>>
>> [2019-12-05 22:00:46.987242] W [fuse-bridge.c:2827:fuse_readv_cbk] 
>> 0-glusterfs-fuse: 91701328: READ => -1 
>> gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4 fd=0x7f02f005b708 
>> (Transport endpoint is not connected)
>> [2019-12-05 22:00:47.029947] W [fuse-bridge.c:2827:fuse_readv_cbk] 
>> 0-glusterfs-fuse: 91701329: READ => -1 
>> gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4 fd=0x7f02f005b708 
>> (Transport endpoint is not connected)
>> [2019-12-05 22:00:49.901075] W [fuse-bridge.c:2827:fuse_readv_cbk] 
>> 0-glusterfs-fuse: 91701330: READ => -1 
>> gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956 fd=0x7f02f002bee8 
>> (Transport endpoint is not connected)
>> [2019-12-05 22:00:49.923525] W [fuse-bridge.c:2827:fuse_readv_cbk] 
>> 0-glusterfs-fuse: 91701331: READ => -1 
>> gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956 fd=0x7f02f002bee8 
>> (Transport endpoint is not connected)
>> [2019-12-05 22:00:49.970219] W [fuse-bridge.c:2827:fuse_readv_cbk] 
>> 0-glusterfs-fuse: 91701332: READ => -1 
>> gfid=fcec6b7a-ad23-4449-aa09-107e113877a1 fd=0x7f02f008dd58 
>> (Transport endpoint is not connected)
>> [2019-12-05 22:00:50.023932] W [fuse-bridge.c:2827:fuse_readv_cbk] 
>> 0-glusterfs-fuse: 91701333: READ => -1 
>> gfid=fcec6b7a-ad23-4449-aa09-107e113877a1 fd=0x7f02f008dd58 
>> (Transport endpoint is not connected)
>> [2019-12-05 22:00:54.807833] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 
>> 0-GL1image-client-2: changing port to 49153 (from 0)
>> [2019-12-05 22:00:54.808043] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 
>> 0-GL1image-client-1: changing port to 49152 (from 0)
>> [2019-12-05 22:00:46.115076] E [MSGID: 133014] 
>> [shard.c:1799:shard_common_stat_cbk] 0-GL1image-shard: stat failed: 
>> 7a5959d6-75fc-411d-8831-57a744776ed3 [Transport endpoint is not 
>> connected]
>> [2019-12-05 22:00:54.820394] I [MSGID: 114046] 
>> [client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-1: 
>> Connected to GL1image-client-1, attached to remote volume 
>> '/GLUSTER/GL1image'.
>> [2019-12-05 22:00:54.820447] I [MSGID: 114042] 
>> [client-handshake.c:930:client_post_handshake] 0-GL1image-client-1: 
>> 10 fds open - Delaying child_up until they are re-opened
>> [2019-12-05 22:00:54.820549] I [MSGID: 114046] 
>> [client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-2: 
>> Connected to GL1image-client-2, attached to remote volume 
>> '/GLUSTER/GL1image'.
>> [2019-12-05 22:00:54.820568] I [MSGID: 114042] 
>> [client-handshake.c:930:client_post_handshake] 0-GL1image-client-2: 
>> 10 fds open - Delaying child_up until they are re-opened
>> [2019-12-05 22:00:54.821381] I [MSGID: 114041] 
>> [client-handshake.c:318:client_child_up_reopen_done] 
>> 0-GL1image-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP
>> [2019-12-05 22:00:54.821406] I [MSGID: 108002] 
>> [afr-common.c:5602:afr_notify] 0-GL1image-replicate-0: Client-quorum 
>> is met
>> [2019-12-05 22:00:54.821446] I [MSGID: 114041] 
>> [client-handshake.c:318:client_child_up_reopen_done] 
>> 0-GL1image-client-2: last fd open'd/lock-self-heal'd - notifying CHILD-UP
>>
>> What is odd is that the gluster logs on the #3 and #1 show absolutely 
>> ZERO gluster errors around that time nor do I show any Network/teamd 
>> errors on any of the  3 nodes (including the problem node #2)
>>
>> I've checked dmesg/syslog and every other log file on the box.
>>
>> According to a staff member, we had this same kvm host have the same 
>> problem about 3 weeks ago, it was written up as a fluke possible due 
>> to excess disk I/O, since we have been using gluster for years and 
>> rarely have seen issues, especially with very basic gluster usage.
>>
>> In this case those VMs weren't overly busy and now we have a repeat 
>> problem.
>>
>> So I am wondering where else I can look to diagnose the problem or 
>> should I abandon the hardware/setup?
>>
>> I assume its a networking issue and not on gluster, but I am confused 
>> why gluster nodes #1 and #3 didn't complain about not seeing #2? If 
>> the networking did drop out should they have noticed?
>>
>> There also doesn't appear to be any visible hard disk issues (smartd 
>> is running)
>>
>> Side Note: I have reset the tcp-timeout back to 42 seconds and will 
>> look at upgrading to 6.6. I also see that the ARB and the unaffected 
>> Gluster node were running Gluster 6.4 (I don't know why #2 is on 6.5 
>> but I am checking on that as well, we turn off auto-upgrade)
>>
>> Maybe the mismatched versions are the culprit?
>>
>> Also, we have a large of these replica 2+1 gluster setups running 
>> gluster version from 5.x up and none of the others have had this issue
>>
>> Any advise would be appreciated.
>>
>> Sincerely,
>>
>> Wk
>>
>>
>>
>>
>>
>>
>> ________
>>
>> Community Meeting Calendar:
>>
>> APAC Schedule -
>> Every 2nd and 4th Tuesday at 11:30 AM IST
>> Bridge:https://bluejeans.com/441850968
>>
>> NA/EMEA Schedule -
>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>> Bridge:https://bluejeans.com/441850968
>>
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20191212/29e55a44/attachment.html>