[Bugs] [Bug 1450933] New: [New] - Replacing an arbiter brick while I/ O happens causes vm pause

Mon May 15 12:25:25 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1450933

            Bug ID: 1450933
           Summary: [New] - Replacing an arbiter brick while I/O happens
                    causes vm pause
           Product: GlusterFS
           Version: 3.11
         Component: arbiter
          Assignee: bugs at gluster.org
          Reporter: ravishankar at redhat.com
                CC: bugs at gluster.org
        Depends On: 1443980, 1449610
            Blocks: 1277939 (Gluster-HC-2)

+++ This bug was initially created as a clone of Bug #1449610 +++

+++ This bug was initially created as a clone of Bug #1443980 +++

Description of problem:
I see that while replacing an arbiter brick with I/O happening all the vms goes
to paused state.

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-18.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install HC setup
2. Create vms 
3. Now start running fio on the vms.
4. while fio is happening replace arbiter brick in the volume from the UI

Actual results:
Replacing arbiter brick causes vm pause.

Expected results:
vms should not pause when arbiter brick is replaced.

Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-04-20
07:24:37 EDT ---

This bug is automatically being proposed for the current release of Red Hat
Gluster Storage 3 under active development, by setting the release flag
'rhgs‑3.3.0' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from RamaKasturi on 2017-04-20 07:44:59 EDT ---

--- Additional comment from Ravishankar N on 2017-05-08 08:35:23 EDT ---

Notes:
I was able to re-create the issue with glusterfs-3.8.4-18 on the set up loaned
from Kasturi but when I tried a more recent source with the following commits
on top of 3.8.4.-18 I was not able to hit it any more:
------------------------
* 586b35b - cluster/dht: rm -rf fails if dir has stale linkto files (3 days
ago) <N Balachandran>
* 9a8cc33 - glusterd/geo-rep: Fix snapshot create in geo-rep setup (4 days ago)
<Kotresh HR>
* 6494b76 - features/shard: Initialize local->fop in readv (4 days ago)
<Krutika Dhananjay>
* 4a6bdc7 - features/shard: Fix vm corruption upon fix-layout (4 days ago)
<Krutika Dhananjay>
* f25b41d - nfs: make subdir mounting work for Solaris 10 clients (4 days ago)
<Bipin Kunal>
* 8606249 - glusterd: Fix snapshot failure in non-root geo-rep setup (4 days
ago) <Kotresh HR>
* 37edb3f - rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup (4
weeks ago) <Atin Mukherjee>
* 795a095 - features/shard: Pass the correct iatt for cache invalidation (4
weeks ago) <Krutika Dhananjay>
------------------------
I have requested Kasturi to try the repro out on rhgs-3.3 build but in the
meantime I am continuing to debug it on 3.8.4.18 to find the RCA.

>From the fuse mount logs, I see that *after* the graph switch, fuse_readv_cbk
is failing with ENOTCONN, which should not happen.

[2017-05-04 12:34:20.865336] I [fuse-bridge.c:5241:fuse_graph_setup] 0-fuse:
switched to graph 15

2017-05-04 12:34:20.894025] W [fuse-bridge.c:2228:fuse_readv_cbk]
0-glusterfs-fuse: 913157: READ => -1 gfid=036fe3b4-27b2-4103-825c-508e87635498
fd=0x7f15a53822c8 (Transport endpoint is not connected)

On attaching gdb to fuse_readv_cbk, I see that mknod FOP has failed with
ENOTCONN:
#0  fuse_readv_cbk (frame=0x7fb1944b3208, cookie=<optimized out>,
this=0x7fb1989b2dc0, op_ret=-1, op_errno=107, vector=0x0, count=0, stbuf=0x0,
iobref=0x0, xdata=0x0) at fuse-bridge.c:2223
#1  0x00007fb18e32102d in io_stats_readv_cbk (frame=0x7fb1944ac9b0,
cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=107,
vector=0x0, count=0, buf=0x0, iobref=0x0, xdata=0x0) at io-stats.c:2035
#2  0x00007fb196a40072 in default_readv_cbk (frame=0x7fb1944af244,
cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=107,
vector=0x0, count=0, stbuf=0x0, iobref=0x0, xdata=0x0) at defaults.c:1105
#3  0x00007fb196a40072 in default_readv_cbk (frame=0x7fb1944b55a4,
cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=107,
vector=0x0, count=0, stbuf=0x0, iobref=0x0, xdata=0x0) at defaults.c:1105
#4  0x00007fb18eb6d0b9 in shard_post_mknod_readv_handler (frame=0x7fb1944a4604,
this=<optimized out>) at shard.c:3225
#5  0x00007fb18eb679d8 in shard_common_mknod_cbk (frame=0x7fb1944a4604,
cookie=<optimized out>, this=0x7fb188217230, op_ret=<optimized out>,
op_errno=<optimized out>, inode=<optimized out>, buf=0x7fb184037468,
preparent=0x7fb1840374d8,
    postparent=0x7fb184037548, xdata=0x0) at shard.c:3270
#6  0x00007fb18eda92a9 in dht_newfile_cbk (frame=0x7fb1944a2e00,
cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=107,
inode=0x7fb17211d418, stbuf=0x7fb184037468, preparent=0x7fb1840374d8,
postparent=0x7fb184037548,
    xdata=0x0) at dht-common.c:5305
#7  0x00007fb18f006b17 in afr_mknod_unwind (frame=<optimized out>,
this=<optimized out>) at afr-dir-write.c:554
#8  0x00007fb18f028f31 in __afr_txn_write_done (frame=0x7fb1944b3b24,
this=0x7fb188215af0) at afr-transaction.c:195
#9  0x00007fb18f02e0ab in afr_unlock_common_cbk
(frame=frame at entry=0x7fb1944b3b24, this=this at entry=0x7fb188215af0, xdata=0x0,
op_errno=<optimized out>, op_ret=<optimized out>, cookie=<optimized out>) at
afr-lk-common.c:633
#10 0x00007fb18f02e9a4 in afr_unlock_entrylk_cbk (frame=0x7fb1944b3b24,
cookie=<optimized out>, this=0x7fb188215af0, op_ret=<optimized out>,
op_errno=<optimized out>, xdata=<optimized out>) at afr-lk-common.c:829
#11 0x00007fb18f291991 in client3_3_entrylk_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7fb1944a3e90) at
client-rpc-fops.c:1657
#12 0x00007fb19678d850 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7fb188706c70, pollin=pollin at entry=0x7fb1800a4a00) at
rpc-clnt.c:792
#13 0x00007fb19678db3f in rpc_clnt_notify (trans=<optimized out>,
mydata=0x7fb188706ca0, event=<optimized out>, data=0x7fb1800a4a00) at
rpc-clnt.c:985
#14 0x00007fb1967899f3 in rpc_transport_notify (this=this at entry=0x7fb188760180,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7fb1800a4a00)
at rpc-transport.c:538
#15 0x00007fb191577314 in socket_event_poll_in (this=this at entry=0x7fb188760180)
at socket.c:2272
#16 0x00007fb1915797c5 in socket_event_handler (fd=<optimized out>, idx=5,
data=0x7fb188760180, poll_in=1, poll_out=0, poll_err=0) at socket.c:2402
#17 0x00007fb196a1d770 in event_dispatch_epoll_handler (event=0x7fb18ce83e80,
event_pool=0x7fb1989a4f00) at event-epoll.c:571
#18 event_dispatch_epoll_worker (data=0x7fb18803a330) at event-epoll.c:674
#19 0x00007fb195824dc5 in start_thread (arg=0x7fb18ce84700) at
pthread_create.c:308
#20 0x00007fb19516973d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Continuing to investigate why mknod failed.

--- Additional comment from Ravishankar N on 2017-05-08 09:36:32 EDT ---

More from the mount logs:
[2017-05-04 12:34:20.893105] D [MSGID: 0]
[dht-common.c:5380:dht_mknod_wind_to_avail_subvol] 15-data-dht: creating
/.shard/036fe3b4-27b2-4103-825c-508e87635498.2446 on data-replicate-0

5473 [2017-05-04 12:34:20.893533] D [MSGID: 114031]
[client-rpc-fops.c:237:client3_3_mknod_cbk] 15-data-client-0: remote operation
failed. Path: /.shard/036fe3b4-27b2-4103-825c-508e87635498.2446 [File exists]

[2017-05-04 12:34:20.893575] D [MSGID: 114031]
[client-rpc-fops.c:237:client3_3_mknod_cbk] 15-data-client-1: remote operation
failed. Path: /.shard/036fe3b4-27b2-4103-825c-508e87635498.2446 [File exists]

It looks like shard xlator sends mknod instead of lookups. Since it is present
in the 1st and 2 brick, we get EEXIST from them. On the 3rd brick (arbiter)
which is just replaced, it succeeds. In arbiter volumes, AFR sends ENOTCONN if
the FOP did not meet quorum. So the mknod fails with ENOTCONN.

Shard I think is tuned only to ignore EEXIST errors but AFR sends ENOTCONN:
[2017-05-04 12:34:20.893975] D [MSGID: 0] [shard.c:3260:shard_common_mknod_cbk]
15-data-shard: mknod of shard 2446 failed: Transport endpoint is not connected

So it propagates the error upto FUSE.
I think the reason it succeeded with the patches mentioned in comment #3 is due
to commit `4a6bdc7 - features/shard: Fix vm corruption upon fix-layout` which
"changes shard fops to do a LOOKUP before the MKNOD", thereby triggering name
heal in afr.

I will need to confirm this analysis. FWIW, we need to consider a part of the
changes in https://review.gluster.org/#/c/14769/ which propagates the
appropriate errno instead of ENOTCONN (see commit message).

--- Additional comment from Worker Ant on 2017-05-10 07:24:56 EDT ---

REVIEW: https://review.gluster.org/17235 (afr: set correct errno in
afr_txn_arbitrate_fop_cbk()) posted (#1) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2017-05-12 07:41:27 EDT ---

REVIEW: https://review.gluster.org/17235 (afr: propagate correct errno for fop
failures in arbiter) posted (#2) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2017-05-15 02:48:01 EDT ---

REVIEW: https://review.gluster.org/17235 (afr: propagate correct errno for fop
failures in arbiter) posted (#3) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2017-05-15 08:18:36 EDT ---

COMMIT: https://review.gluster.org/17235 committed in master by Jeff Darcy
(jeff at pl.atyp.us) 
------
commit 93c850dd2a513fab75408df9634ad3c970a0e859
Author: Ravishankar N <ravishankar at redhat.com>
Date:   Wed May 10 10:03:08 2017 +0530

    afr: propagate correct errno for fop failures in arbiter

    Problem:
    If quorum is not met in fop cbk, arbiter sends an ENOTCONN error to the
    upper xlators. In a VM workload with sharding enabled, this was leading
    to the VM pausing when replace-brick was performed as described in the BZ.

    Fix:
    Move the fop cbk arbitration logic to afr_handle_quorum() because in
    normal replica volumes, that is the function that has the quorum and
    errno checks in the fop cbk path before doing a post-op.

    Thanks to Pranith for suggesting this approach.

    Change-Id: Ie6315db30c5e36326b71b90a01da824109e86796
    BUG: 1449610
    Signed-off-by: Ravishankar N <ravishankar at redhat.com>
    Reviewed-on: https://review.gluster.org/17235
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu at redhat.com>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1277939
[Bug 1277939] (Gluster-HC-2) [TRACKER] Gluster Hyperconvergence - MVP
https://bugzilla.redhat.com/show_bug.cgi?id=1443980
[Bug 1443980] [New] - Replacing an arbiter brick while I/O happens  causes
vm pause
https://bugzilla.redhat.com/show_bug.cgi?id=1449610
[Bug 1449610] [New] - Replacing an arbiter brick while I/O happens  causes
vm pause
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.