[Bugs] [Bug 1330855] New: A replicated volume takes too much to come online when one server is down

Wed Apr 27 07:07:31 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1330855

            Bug ID: 1330855
           Summary: A replicated volume takes too much to come online when
                    one server is down
           Product: GlusterFS
           Version: 3.7.11
         Component: replicate
          Keywords: Triaged
          Assignee: bugs at gluster.org
          Reporter: ravishankar at redhat.com
                CC: bugs at gluster.org, roger.lehmann at marktjagd.de,
                    xhernandez at datalab.es
        Depends On: 1054694
            Blocks: 1286911, 1218731

+++ This bug was initially created as a clone of Bug #1054694 +++

Description of problem:

When you start a replicated volume with at least one of its bricks managed by
an unreachable server, it can take up to one or two minutes to become ready.

This could be acceptable for FUSE mounts since this only happens on mount and
then the volume works as expected, however other applications based on
libgfapi, where connection/disconnection can happen more often, this can break
them.

For example, Proxmox establishes a timeout of 30 seconds to start a virtual
machine (AFAIK this timeout is not configurable). On start, it uses libgfapi to
connect to the virtual disk. If one server is unreachable, it takes too long
and it times out, making impossible to start machines if gluster is not
healthy. When gluster is running fine, it starts a machine in 1 or 2 seconds.

Version-Release number of selected component (if applicable): discovered on
3.4.1, but also happens on master

How reproducible:

always

Steps to Reproduce:
# mkdir -p /gluster/test
# gluster volume create test replica 2 node2:/bricks/test node3:/bricks/test
volume create: test: success: please start the volume to access data
# gluster volume start test
volume start: test: success
# time mount -t glusterfs node2:/test /gluster/test; ls -l /gluster/test

real    0m0.220s
user    0m0.042s
sys    0m0.063s
total 0
# umount /gluster/test 
# ssh node3 poweroff
# time mount -t glusterfs node2:/test /gluster/test; ls -l /gluster/test

real    2m7.479s
user    0m0.036s
sys    0m0.075s
total 0

Actual results:

It takes too long to make the volume accessible.

Expected results:

If at least one brick is online, it should make the volume available sooner.

Additional info:

The root cause seems related to the TCP connection timeout. If gluster is down
but the server is alive, it immediately replies with a "Connection refused" and
the delay disappears. However if server is down or IP is unreachable, it needs
to wait for a while.

A first workaround is to lower the value of /proc/sys/net/ipv4/tcp_syn_retries.
This value is tipically set to 5 or 6 that means a timeout of about one or two
minutes. A value of 2 or 3 reduces the timeout to 8 or 16 seconds. However this
could have other consequences depending on the environment.

As a side note, I would also change the message that AFR shows when it sees the
first subvolume online as it seems misleading to me. It says "Subvolume
'test-client-0' came back up; going online.", but it really doesn't go online
until the message "switched to graph 0" is shown because it waits until it has
seen an answer from every subvolume (even it that answer is a timeout). An
example from the log file:

[2014-01-17 09:35:03.648844] I [afr-common.c:3928:afr_notify]
0-test-replicate-0: Subvolume 'test-client-0' came back up; going online.
[2014-01-17 09:35:03.649014] I
[client-handshake.c:450:client_set_lk_version_cbk] 0-test-client-0: Server lk
version = 1
[2014-01-17 09:37:10.921156] E [socket.c:2161:socket_connect_finish]
0-test-client-1: connection to 10.0.0.3:24007 failed (Connection timed out)
[2014-01-17 09:37:10.932407] I [fuse-bridge.c:4956:fuse_graph_setup] 0-fuse:
switched to graph 0
[2014-01-17 09:37:10.932935] I [fuse-bridge.c:3893:fuse_init] 0-glusterfs-fuse:
FUSE inited with protocol versions: glusterfs 7.22 kernel 7.22
[2014-01-17 09:37:10.933990] I
[afr-common.c:2218:afr_set_root_inode_on_first_lookup] 0-test-replicate-0:
added root inode

--- Additional comment from Roger Lehmann on 2015-04-02 08:59:00 EDT ---

+1 I can confirm this bug.
This is a huge problem for automatic failover in Proxmox. Without the
workaround reducing the tcp_syn_retries count, any HA VM takes longer than the
30 sec timeout for Proxmox to start.

Please fix it, thank you.

--- Additional comment from Ravishankar N on 2015-05-02 01:20:06 EDT ---

I'm able to reproduce the issue on a plain 2x1 distribute volume also. Mounting
a client on the node which is up hangs up until the network.ping-timeout value.
After changing it from the default 42 to 20 seconds, even umount seems to hang
for that time:
--------------------
[2015-05-02 05:09:53.783067] I
[client-handshake.c:187:client_set_lk_version_cbk] 0-testvol-client-1: Server
lk version = 1
[2015-05-02 05:10:37.735298] C
[rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-testvol-client-1: server
10.70.42.188:49152 has not responded in the last 20 seconds, disconnecting.
[2015-05-02 05:10:37.736622] E [rpc-clnt.c:362:saved_frames_unwind] (-->
/usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x240)[0x7ff771ce0622] (-->
/usr/local/lib/libgfrpc.so.0(saved_frames_unwind+0x212)[0x7ff771aa8f02] (-->
/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7ff771aa8fff] (-->
/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7ff771aa9491]
(--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x147)[0x7ff771aa9e8b] )))))
0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
called at 2015-05-02 05:10:17.019716 (xid=0xa)
[2015-05-02 05:10:37.736795] W [client-rpc-fops.c:2824:client3_3_lookup_cbk]
0-testvol-client-1: remote operation failed: Transport endpoint is not
connected. Path: / (00000000-0000-0000-0000-000000000001)
[2015-05-02 05:10:37.737991] E [rpc-clnt.c:362:saved_frames_unwind] (-->
/usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x240)[0x7ff771ce0622] (-->
/usr/local/lib/libgfrpc.so.0(saved_frames_unwind+0x212)[0x7ff771aa8f02] (-->
/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7ff771aa8fff] (-->
/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7ff771aa9491]
(--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x147)[0x7ff771aa9e8b] )))))
0-testvol-client-1: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at
2015-05-02 05:10:17.019759 (xid=0xb)
[2015-05-02 05:10:37.738092] W [rpc-clnt-ping.c:204:rpc_clnt_ping_cbk]
0-testvol-client-1: socket disconnected
[2015-05-02 05:10:37.738158] I [client.c:2086:client_rpc_notify]
0-testvol-client-1: disconnected from testvol-client-1. Client process will
keep trying to connect to glusterd until brick's port is available
[2015-05-02 05:10:37.741950] I [fuse-bridge.c:4922:fuse_thread_proc] 0-fuse:
unmounting /mnt/fuse_mnt
[2015-05-02 05:10:37.742171] W [glusterfsd.c:1212:cleanup_and_exit] (--> 0-:
received signum (15), shutting down
[2015-05-02 05:10:37.742580] I [fuse-bridge.c:5617:fini] 0-fuse: Unmounting
'/mnt/fuse_mnt'. 
-----------------------------

--- Additional comment from Anand Avati on 2015-06-07 08:36:12 EDT ---

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after
timeout) posted (#1) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Anand Avati on 2015-06-09 09:04:20 EDT ---

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after
timeout) posted (#2) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Anand Avati on 2015-06-11 07:38:02 EDT ---

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after
timeout) posted (#4) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Anand Avati on 2015-06-11 09:23:02 EDT ---

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after
timeout) posted (#5) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Vijay Bellur on 2015-12-23 04:39:58 EST ---

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after
timeout) posted (#7) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Vijay Bellur on 2016-01-06 07:00:18 EST ---

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after
timeout) posted (#8) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Vijay Bellur on 2016-02-08 07:43:59 EST ---

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after
timeout) posted (#9) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Vijay Bellur on 2016-02-10 11:42:07 EST ---

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after
timeout) posted (#10) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Mike McCune on 2016-03-28 19:23:30 EDT ---

This bug was accidentally moved from POST to MODIFIED via an error in
automation, please see mmccune at redhat.com with any questions

--- Additional comment from Vijay Bellur on 2016-04-10 03:49:39 EDT ---

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after
timeout) posted (#11) for review on master by Pranith Kumar Karampuri
(pkarampu at redhat.com)

--- Additional comment from Vijay Bellur on 2016-04-25 09:07:34 EDT ---

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after
timeout) posted (#12) for review on master by Ravishankar N
(ravishankar at redhat.com)

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1054694
[Bug 1054694] A replicated volume takes too much to come online when one
server is down
https://bugzilla.redhat.com/show_bug.cgi?id=1218731
[Bug 1218731] A replicated volume takes too much to come online when one
server is down
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.