[Bugs] [Bug 1218731] New: A replicated volume takes too much to come online when one server is down

Tue May 5 17:12:29 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1218731

            Bug ID: 1218731
           Summary: A replicated volume takes too much to come online when
                    one server is down
           Product: Red Hat Gluster Storage
           Version: 2.1
         Component: glusterfs
     Sub Component: afr
          Keywords: Triaged
          Severity: high
          Assignee: pkarampu at redhat.com
          Reporter: hamiller at redhat.com
        QA Contact: spandura at redhat.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com,
                    ravishankar at redhat.com, roger.lehmann at marktjagd.de,
                    xhernandez at datalab.es
        Depends On: 1054694
             Group: redhat

+++ This bug was initially created as a clone of Bug #1054694 +++

Description of problem:

When you start a replicated volume with at least one of its bricks managed by
an unreachable server, it can take up to one or two minutes to become ready.

This could be acceptable for FUSE mounts since this only happens on mount and
then the volume works as expected, however other applications based on
libgfapi, where connection/disconnection can happen more often, this can break
them.

For example, Proxmox establishes a timeout of 30 seconds to start a virtual
machine (AFAIK this timeout is not configurable). On start, it uses libgfapi to
connect to the virtual disk. If one server is unreachable, it takes too long
and it times out, making impossible to start machines if gluster is not
healthy. When gluster is running fine, it starts a machine in 1 or 2 seconds.

Version-Release number of selected component (if applicable): discovered on
3.4.1, but also happens on master

How reproducible:

always

Steps to Reproduce:
# mkdir -p /gluster/test
# gluster volume create test replica 2 node2:/bricks/test node3:/bricks/test
volume create: test: success: please start the volume to access data
# gluster volume start test
volume start: test: success
# time mount -t glusterfs node2:/test /gluster/test; ls -l /gluster/test

real    0m0.220s
user    0m0.042s
sys    0m0.063s
total 0
# umount /gluster/test 
# ssh node3 poweroff
# time mount -t glusterfs node2:/test /gluster/test; ls -l /gluster/test

real    2m7.479s
user    0m0.036s
sys    0m0.075s
total 0

Actual results:

It takes too long to make the volume accessible.

Expected results:

If at least one brick is online, it should make the volume available sooner.

Additional info:

The root cause seems related to the TCP connection timeout. If gluster is down
but the server is alive, it immediately replies with a "Connection refused" and
the delay disappears. However if server is down or IP is unreachable, it needs
to wait for a while.

A first workaround is to lower the value of /proc/sys/net/ipv4/tcp_syn_retries.
This value is tipically set to 5 or 6 that means a timeout of about one or two
minutes. A value of 2 or 3 reduces the timeout to 8 or 16 seconds. However this
could have other consequences depending on the environment.

As a side note, I would also change the message that AFR shows when it sees the
first subvolume online as it seems misleading to me. It says "Subvolume
'test-client-0' came back up; going online.", but it really doesn't go online
until the message "switched to graph 0" is shown because it waits until it has
seen an answer from every subvolume (even it that answer is a timeout). An
example from the log file:

[2014-01-17 09:35:03.648844] I [afr-common.c:3928:afr_notify]
0-test-replicate-0: Subvolume 'test-client-0' came back up; going online.
[2014-01-17 09:35:03.649014] I
[client-handshake.c:450:client_set_lk_version_cbk] 0-test-client-0: Server lk
version = 1
[2014-01-17 09:37:10.921156] E [socket.c:2161:socket_connect_finish]
0-test-client-1: connection to 10.0.0.3:24007 failed (Connection timed out)
[2014-01-17 09:37:10.932407] I [fuse-bridge.c:4956:fuse_graph_setup] 0-fuse:
switched to graph 0
[2014-01-17 09:37:10.932935] I [fuse-bridge.c:3893:fuse_init] 0-glusterfs-fuse:
FUSE inited with protocol versions: glusterfs 7.22 kernel 7.22
[2014-01-17 09:37:10.933990] I
[afr-common.c:2218:afr_set_root_inode_on_first_lookup] 0-test-replicate-0:
added root inode

--- Additional comment from Roger Lehmann on 2015-04-02 08:59:00 EDT ---

+1 I can confirm this bug.
This is a huge problem for automatic failover in Proxmox. Without the
workaround reducing the tcp_syn_retries count, any HA VM takes longer than the
30 sec timeout for Proxmox to start.

Please fix it, thank you.

--- Additional comment from Ravishankar N on 2015-05-02 01:20:06 EDT ---

I'm able to reproduce the issue on a plain 2x1 distribute volume also. Mounting
a client on the node which is up hangs up until the network.ping-timeout value.
After changing it from the default 42 to 20 seconds, even umount seems to hang
for that time:
--------------------
[2015-05-02 05:09:53.783067] I
[client-handshake.c:187:client_set_lk_version_cbk] 0-testvol-client-1: Server
lk version = 1
[2015-05-02 05:10:37.735298] C
[rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-testvol-client-1: server
10.70.42.188:49152 has not responded in the last 20 seconds, disconnecting.
[2015-05-02 05:10:37.736622] E [rpc-clnt.c:362:saved_frames_unwind] (-->
/usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x240)[0x7ff771ce0622] (-->
/usr/local/lib/libgfrpc.so.0(saved_frames_unwind+0x212)[0x7ff771aa8f02] (-->
/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7ff771aa8fff] (-->
/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7ff771aa9491]
(--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x147)[0x7ff771aa9e8b] )))))
0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
called at 2015-05-02 05:10:17.019716 (xid=0xa)
[2015-05-02 05:10:37.736795] W [client-rpc-fops.c:2824:client3_3_lookup_cbk]
0-testvol-client-1: remote operation failed: Transport endpoint is not
connected. Path: / (00000000-0000-0000-0000-000000000001)
[2015-05-02 05:10:37.737991] E [rpc-clnt.c:362:saved_frames_unwind] (-->
/usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x240)[0x7ff771ce0622] (-->
/usr/local/lib/libgfrpc.so.0(saved_frames_unwind+0x212)[0x7ff771aa8f02] (-->
/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7ff771aa8fff] (-->
/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7ff771aa9491]
(--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x147)[0x7ff771aa9e8b] )))))
0-testvol-client-1: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at
2015-05-02 05:10:17.019759 (xid=0xb)
[2015-05-02 05:10:37.738092] W [rpc-clnt-ping.c:204:rpc_clnt_ping_cbk]
0-testvol-client-1: socket disconnected
[2015-05-02 05:10:37.738158] I [client.c:2086:client_rpc_notify]
0-testvol-client-1: disconnected from testvol-client-1. Client process will
keep trying to connect to glusterd until brick's port is available
[2015-05-02 05:10:37.741950] I [fuse-bridge.c:4922:fuse_thread_proc] 0-fuse:
unmounting /mnt/fuse_mnt
[2015-05-02 05:10:37.742171] W [glusterfsd.c:1212:cleanup_and_exit] (--> 0-:
received signum (15), shutting down
[2015-05-02 05:10:37.742580] I [fuse-bridge.c:5617:fini] 0-fuse: Unmounting
'/mnt/fuse_mnt'. 
-----------------------------

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1054694
[Bug 1054694] A replicated volume takes too much to come online when one
server is down
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=RNtB82QUYV&a=cc_unsubscribe