[Gluster-devel] discussion about mount slow issue when stop sn-0

Mon Nov 21 03:19:04 UTC 2016

Hi Gluster experts,

Good day!
I am checking one issue related to Glusterfs, and found following issue.

Description of issue:
1>our cluster env has three storage node , sn-0(169.254.0.24)and sn-1(169.254.0.23) is deployed as replicate storage nodes. Sn-2 is quorum node.
2>stop sn-0 node(so sn-0 become unreachable suddenly), and execute mound command: time mount -t glusterfs 169.254.0.24:/log /mnt/test2 -o rw,backupvolfile-server=169.254.0.23
3> this command will not finish until 5 minutes pass.
4> wait some time (10 minutes) , execute this mount command again, the same command will finish within several seconds.

My analysis:
On the second time the mount command could return quickly, while the first time mount hang 5 minutes, I compare the glusterfs log file, and find
1>on the first time the rpc client reconnect keeps trying until following prints:
   [2016-11-17 02:50:23.593340] E [socket.c:2282:socket_connect_finish] 0-log-client-0: connection to 169.254.0.24:24007 failed (Connection timed out)
  [2016-11-17 02:50:23.593360] T [socket.c:738:__socket_disconnect] 0-log-client-0: disconnecting 0xe2a280, state=0 gen=0 sock=12
   [2016-11-17 02:50:23.593370] D [socket.c:719:__socket_shutdown] 0-log-client-0: shutdown() returned -1. Transport endpoint is not connected
   [2016-11-17 02:50:23.593378] D [socket.c:2359:socket_event_handler] 0-transport: disconnecting now
   but the begin time of this retry is:
   [2016-11-17 02:48:20.416900] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-log-client-0: attempting reconnect
   Glusterfs process try a lot of times and the time elapse is around 2 minutes
2>On the second time the rpc client also keeps reconnecting but each time it returns very quickly with following prints:
   [2016-11-17 04:27:06.045700] E [socket.c:2282:socket_connect_finish] 0-log-client-0: connection to 169.254.0.24:24007 failed (No route to host)
   [2016-11-17 04:27:06.045751] T [socket.c:738:__socket_disconnect] 0-log-client-0: disconnecting 0xffc4a0, state=0 gen=0 sock=12
   [2016-11-17 04:27:06.045760] D [socket.c:719:__socket_shutdown] 0-log-client-0: shutdown() returned -1. Transport endpoint is not connected
   [2016-11-17 04:27:06.045766] D [socket.c:2359:socket_event_handler] 0-transport: disconnecting now

Questions:
      Why socket_connect_finish is called much later for the first time mount? I assume it is because in the first mount connect() timeout is slowly and the second time connect() return much faster?
      Is there any way to control the rpc_clnt_reconnect times? Because seems this is a waste, no need to retry so many times.

 Attachment: two times mount trace log of glusterfs.

Best regards,
Cynthia （周琳）
MBB SM HETRAN SW3 MATRIX
Storage
Mobile: +86 (0)18657188311

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20161121/d230ced1/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: when nova stop sn-0 mount slow on sn1 and sn2 issue.zip
Type: application/x-zip-compressed
Size: 16667 bytes
Desc: when nova stop sn-0 mount slow on sn1 and sn2 issue.zip
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20161121/d230ced1/attachment-0001.bin>