[Bugs] [Bug 1426508] New: Application VM paused after add brick operation and VM didn' t come up after power cycle.

Fri Feb 24 05:25:45 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1426508

            Bug ID: 1426508
           Summary: Application VM paused after add brick operation and VM
                    didn't come up after power cycle.
           Product: GlusterFS
           Version: 3.10
         Component: sharding
          Keywords: Triaged
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: kdhananj at redhat.com
        QA Contact: bugs at gluster.org
                CC: amukherj at redhat.com, bsrirama at redhat.com,
                    bugs at gluster.org, csaba at redhat.com,
                    kdhananj at redhat.com, ravishankar at redhat.com,
                    rcyriac at redhat.com, rgowdapp at redhat.com,
                    rhs-bugs at redhat.com, sabose at redhat.com,
                    sasundar at redhat.com, storage-qa-internal at redhat.com
            Blocks: 1387878

+++ This bug was initially created as a clone of Bug #1420623 +++

+++ This bug was initially created as a clone of Bug #1412554 +++

Description of problem:
=======================
Application VM paused after add brick operation, i had some 3 APP VMs running
on gluster volume (1*3) and converted to 2 *3 volume and at the same time i
added new rhgs node to cluster ( cluster expanded), during this time one of the
app vm (vm1) went to pause state.

we are seeing disconnection messages and client quorum not met error messages.

usterd until brick's port is available
[2017-01-12 06:05:13.382644] I [socket.c:3439:socket_submit_request]
27-Dis-Rep1-client-9: not connected (priv->connected = -1)
[2017-01-12 06:05:13.382658] W [rpc-clnt.c:1657:rpc_clnt_submit]
27-Dis-Rep1-client-9: failed to submit rpc-request (XID: 0xdba6 Program:
GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (Dis-Rep1-client-9)
[2017-01-12 06:05:13.382660] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 27-Dis-Rep1-client-11: disconnected from
Dis-Rep1-client-11. Client process will keep trying to connect to glusterd
until brick's port is available
[2017-01-12 06:05:13.382675] E [MSGID: 114031]
[client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-9: remote
operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:13.382692] W [MSGID: 108001] [afr-common.c:4768:afr_notify]
27-Dis-Rep1-replicate-0: Client-quorum is not met
[2017-01-12 06:05:13.382712] E [MSGID: 114031]
[client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-10: remote
operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:13.382731] I [socket.c:3439:socket_submit_request]
27-Dis-Rep1-client-11: not connected (priv->connected = -1)
[2017-01-12 06:05:13.382738] W [rpc-clnt.c:1657:rpc_clnt_submit]
27-Dis-Rep1-client-11: failed to submit rpc-request (XID: 0xdca1 Program:
GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (Dis-Rep1-client-11)
[2017-01-12 06:05:13.382745] E [MSGID: 114031]
[client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-11: remote
operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:13.382826] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 27-Dis-Rep1-client-9: disconnected from
Dis-Rep1-client-9. Client process will keep trying to connect to glusterd until
brick's port is available
[2017-01-12 06:05:13.382851] E [MSGID: 108006] [afr-common.c:4679:afr_notify]
27-Dis-Rep1-replicate-0: All subvolumes are down. Going offline until atleast
one of them comes back up.
[2017-01-12 06:05:13.407637] E [MSGID: 133007]
[shard.c:3238:shard_common_resume_mknod] 29-Dis-Rep1-shard: Inode path failedon
a15a813e-f512-4f96-858f-91b4fad86d4d.33, base file gfid =
a15a813e-f512-4f96-858f-91b4fad86d4d
[2017-01-12 06:05:13.407692] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 4126899: WRITE => -1
gfid=a15a813e-f512-4f96-858f-91b4fad86d4d fd=0x7fd489699014 (Cannot allocate
memory)
[2017-01-12 06:05:15.043730] W [MSGID: 114031]
[client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-9: remote
operation failed
[2017-01-12 06:05:15.043791] W [MSGID: 114031]
[client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-10: remote
operation failed
[2017-01-12 06:05:15.043813] W [MSGID: 114031]
[client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-11: remote
operation failed
[2017-01-12 06:05:15.043828] E [MSGID: 114031]
[client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-9: remote
operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043840] E [MSGID: 114031]
[client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-10: remote
operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043853] E [MSGID: 114031]
[client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-11: remote
operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043897] W [MSGID: 114031]
[client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-9: remote
operation failed
[2017-01-12 06:05:15.043911] W [MSGID: 114031]
[client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-10: remote
operation failed

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-11

How reproducible:
=================
One time

Steps to Reproduce:
===================
1. Have RHV-RHGS setup with 3 rhgs nodes and 2 virt cluster
2. Create a 2*3 volume and create the data domain using this volume.
3. Create two app vms (say  vm1,vm2 ) using the data domain created in step2
4. reduce the volume to 1*3.
5. Create a one more app vm (say vm3).
6. Add one more rhgs node to existing cluster ( cluster expand)
7. while step6 in-progress, Expand the volume to 2*3 by adding one more replica
pair.

After step-7 done, one of app vm (vm1) went to paused state.

Actual results:
===============
Application VM paused after add brick operation.

Expected results:
=================
VM should not go to pause state.

--- Additional comment from Byreddy on 2017-01-12 04:48:34 EST ---

volume status and info:
========================
[root at dhcp42-35 ~]# gluster volume status
Status of volume: Dis-Rep1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick server1.com:/rhs/brick3/a1            49153     0          Y       32634
Brick server2.com:/rhs/brick3/a2            49153     0          Y       15405
Brick server3.com:/rhs/brick3/a3            49154     0          Y       30612
Brick server1.com:/rhs/brick1/z1            49152     0          Y       2278 
Brick server2.com:/rhs/brick1/z2            49152     0          Y       7560 
Brick server3.com:/rhs/brick1/z3            49152     0          Y       32519
Self-heal Daemon on localhost               N/A       N/A        Y       2324 
Self-heal Daemon on server1.com             N/A       N/A        Y       13501
Self-heal Daemon on server2.com             N/A       N/A        Y       7675 
Self-heal Daemon on server3.com             N/A       N/A        Y       32562

Task Status of Volume Dis-Rep1
------------------------------------------------------------------------------
There are no active volume tasks

[root at localhost ~]# 
[root at localhost ~]# gluster volume info

Volume Name: Dis-Rep1
Type: Distributed-Replicate
Volume ID: 3c0a32d2-d9fd-4647-80cf-76d663fe7e90
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: server1.com:/rhs/brick3/a1
Brick2: server2.com:/rhs/brick3/a2
Brick3: server3.com:/rhs/brick3/a3
Brick4: server1.com:/rhs/brick1/z1
Brick5: server2.com:/rhs/brick1/z2
Brick6: server3.com:/rhs/brick1/z3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
cluster.use-compound-fops: on
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
user.cifs: off
auth.allow: *
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
features.shard: on
cluster.data-self-heal-algorithm: full
storage.owner-uid: 36
storage.owner-gid: 36
server.allow-insecure: on
network.ping-timeout: 30
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
performance.strict-o-direct: on
[root at dhcp42-35 ~]#

--- Additional comment from Ravishankar N on 2017-01-12 05:04:20 EST ---

Looked at the mount log; moving it to rpc after discussing with Rajesh.

1. After add brick was performed, the mount connected to the old replicate
(replicate-0):

2017-01-12 06:05:13.324064] I [MSGID: 114046]
[client-handshake.c:1222:client_setvolume_cbk] 29-Dis-Rep1-client-9: Connected
to Dis-Rep1-client-9, attached to remote volume '/rhs/brick3/a1'

2017-01-12 06:05:13.329514] I [MSGID: 114046]
[client-handshake.c:1222:client_setvolume_cbk] 29-Dis-Rep1-client-10: Connected
to Dis-Rep1-client-10, attached to remote volume '/rhs/brick3/a2'

2017-01-12 06:05:13.334979] I [MSGID: 114046]
[client-handshake.c:1222:client_setvolume_cbk] 29-Dis-Rep1-client-11: Connected
to Dis-Rep1-client-11, attached to remote volume '/rhs/brick3/a3'.

2. We have a graph switch
[2017-01-12 06:05:13.357471] I [fuse-bridge.c:5241:fuse_graph_setup] 0-fuse:
switched to graph 29

3. Old connections are torn down, which implies all pending I/O on old graph is
over:

[2017-01-12 06:05:13.373073] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7fd50e69a06c) to newfd (0x7fd489698efc)
(inode-gfid:dabc7928-44ce-4dd8-999a-0d5d648da532)(old-subvolume:meta-autoload-27
new-subvolume:meta-autoload-29)
[2017-01-12 06:05:13.376778] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7fd5110f1bec) to newfd (0x7fd489698f88)
(inode-gfid:13136355-721d-4fe7-bb76-0a1d35fdb8e0)(old-subvolume:meta-autoload-27
new-subvolume:meta-autoload-29)
[2017-01-12 06:05:13.380813] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7fd504e57308) to newfd (0x7fd489699014)
(inode-gfid:a15a813e-f512-4f96-858f-91b4fad86d4d)(old-subvolume:meta-autoload-27
new-subvolume:meta-autoload-29)
[2017-01-12 06:05:13.382513] I [MSGID: 114021] [client.c:2365:notify]
27-Dis-Rep1-client-9: current graph is no longer active, destroying rpc_client
[2017-01-12 06:05:13.382570] I [MSGID: 114021] [client.c:2365:notify]
27-Dis-Rep1-client-10: current graph is no longer active, destroying rpc_client
[2017-01-12 06:05:13.382596] I [MSGID: 114021] [client.c:2365:notify]
27-Dis-Rep1-client-11: current graph is no longer active, destroying rpc_client
[2017-01-12 06:05:13.382615] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 27-Dis-Rep1-client-10: disconnected from
Dis-Rep1-client-10. Client process will keep trying to connect to glusterd
until brick's port is available
[2017-01-12 06:05:13.382660] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 27-Dis-Rep1-client-11: disconnected from
Dis-Rep1-client-11. Client process will keep trying to connect to glusterd
until brick's port is available
[2017-01-12 06:05:13.382826] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 27-Dis-Rep1-client-9: disconnected from
Dis-Rep1-client-9. Client process will keep trying to connect to glusterd until
brick's port is available

[2017-01-12 06:05:13.382851] E [MSGID: 108006] [afr-common.c:4679:afr_notify]
27-Dis-Rep1-replicate-0: All subvolumes are down. Going offline until atleast
one of them comes back up.

4. But despite this, all further FOPS are failing with ENOTCONN. This should
not happen because all I/O should happen on the new graph and the connections
to the children of replicate-o are valid as per logs in step-1.

[2017-01-12 06:05:13.407692] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 4126899: WRITE => -1
gfid=a15a813e-f512-4f96-858f-91b4fad86d4d fd=0x7fd489699014 (Cannot allocate
memory)
[2017-01-12 06:05:15.043730] W [MSGID: 114031]
[client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-9: remote
operation failed
[2017-01-12 06:05:15.043791] W [MSGID: 114031]
[client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-10: remote
operation failed
[2017-01-12 06:05:15.043813] W [MSGID: 114031]
[client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-11: remote
operation failed
[2017-01-12 06:05:15.043828] E [MSGID: 114031]
[client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-9: remote
operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043840] E [MSGID: 114031]
[client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-10: remote
operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043853] E [MSGID: 114031]
[client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-11: remote
operation failed [Transport endpoint is not connected]

--- Additional comment from Byreddy on 2017-01-12 06:28:37 EST ---

i power cycled the paused VM but it didn't up, i see vm corruption messages
coming in the VM console.

attached the screenshot of it.

--- Additional comment from SATHEESARAN on 2017-01-13 05:28:37 EST ---

(In reply to Byreddy from comment #6)
> Created attachment 1239855 [details]
> VM corruption messages  in console

The attachment shows the XFS filesystem has got corrupted, which means this is
a underlying disk image on the gluster volume has got corrupted - which is a
data loss in the case of VM

--- Additional comment from Atin Mukherjee on 2017-01-25 03:55:52 EST ---

Raghavendra - Can you or some one from RPC team please start looking into this
bug? Rajesh had done an initial analysis of the issue and might be the contact
point to get additional data.

--- Additional comment from Milind Changire on 2017-01-27 03:39:59 EST ---

Byreddy,
Could you test with bricks on ext4 file-system and verify if this is not a
file-system related side effect ?

--- Additional comment from SATHEESARAN on 2017-01-31 05:19:27 EST ---

Snip of errors from fuse mount logs

[2017-01-31 10:05:45.576808] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f8763d16602] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8763adc76e] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8763adc87e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f8763addfe4] (-->
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f8763ade874] )))))
0-imagestore-client-0: forced unwinding frame type(GlusterFS 3.3)
op(FXATTROP(34)) called at 2017-01-31 10:05:45.575497 (xid=0x5fdf)
[2017-01-31 10:05:45.576840] W [MSGID: 114031]
[client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-imagestore-client-0: remote
operation failed
[2017-01-31 10:05:45.576872] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f8763d16602] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8763adc76e] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8763adc87e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f8763addfe4] (-->
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f8763ade874] )))))
0-imagestore-client-1: forced unwinding frame type(GlusterFS 3.3)
op(FXATTROP(34)) called at 2017-01-31 10:05:45.575520 (xid=0x5c7e)
[2017-01-31 10:05:45.576882] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-imagestore-client-0: disconnected from
imagestore-client-0. Client process will keep trying to connect to glusterd
until brick's port is available
[2017-01-31 10:05:45.576945] W [MSGID: 114031]
[client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-imagestore-client-1: remote
operation failed
[2017-01-31 10:05:45.576974] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-imagestore-client-1: disconnected from
imagestore-client-1. Client process will keep trying to connect to glusterd
until brick's port is available
[2017-01-31 10:05:45.576985] W [MSGID: 108001] [afr-common.c:4770:afr_notify]
0-imagestore-replicate-0: Client-quorum is not met
[2017-01-31 10:05:45.577120] E [rpc-clnt.c:365:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f8763d16602] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8763adc76e] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8763adc87e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f8763addfe4] (-->
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f8763ade874] )))))
0-imagestore-client-2: forced unwinding frame type(GlusterFS 3.3)
op(FXATTROP(34)) called at 2017-01-31 10:05:45.575541 (xid=0x5cb7)
[2017-01-31 10:05:45.577141] W [MSGID: 114031]
[client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-imagestore-client-2: remote
operation failed
[2017-01-31 10:05:45.577172] E [MSGID: 114031]
[client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-imagestore-client-0: remote
operation failed [Transport endpoint is not connected]
[2017-01-31 10:05:45.577188] E [MSGID: 114031]
[client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-imagestore-client-1: remote
operation failed [Transport endpoint is not connected]
[2017-01-31 10:05:45.577219] I [socket.c:3439:socket_submit_request]
0-imagestore-client-2: not connected (priv->connected = -1)
[2017-01-31 10:05:45.577228] W [rpc-clnt.c:1657:rpc_clnt_submit]
0-imagestore-client-2: failed to submit rpc-request (XID: 0x5cb8 Program:
GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (imagestore-client-2)
[2017-01-31 10:05:45.577240] E [MSGID: 114031]
[client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-imagestore-client-2: remote
operation failed [Transport endpoint is not connected]
[2017-01-31 10:05:45.577306] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-imagestore-client-2: disconnected from
imagestore-client-2. Client process will keep trying to connect to glusterd
until brick's port is available
[2017-01-31 10:05:45.577352] E [MSGID: 108006] [afr-common.c:4681:afr_notify]
0-imagestore-replicate-0: All subvolumes are down. Going offline until atleast
one of them comes back up.
[2017-01-31 10:05:45.620808] W [fuse-bridge.c:1291:fuse_err_cbk]
0-glusterfs-fuse: 14861: FSYNC() ERR => -1 (Input/output error)

--- Additional comment from SATHEESARAN on 2017-01-31 05:22:01 EST ---

(In reply to Milind Changire from comment #10)
> Byreddy,
> Could you test with bricks on ext4 file-system and verify if this is not a
> file-system related side effect ?

Milind, 

>From all the logs, it looks like a transport issue. Referring to comment4 I
don't think retesting with ext4 is a viable option that we have. With the logs,
its only the logs that would help in RCA'ing the issue. I also have the system
intact for further investigation.

--- Additional comment from Milind Changire on 2017-01-31 05:36:47 EST ---

(In reply to SATHEESARAN from comment #16)
> (In reply to Milind Changire from comment #10)
> > Byreddy,
> > Could you test with bricks on ext4 file-system and verify if this is not a
> > file-system related side effect ?
> 
> Milind, 
> 
> From all the logs, it looks like a transport issue. Referring to comment4 I
> don't think retesting with ext4 is a viable option that we have. With the
> logs, its only the logs that would help in RCA'ing the issue. I also have
> the system intact for further investigation.

Thanks for making time to reproduce the issue sas.
Reproducibility was the major concern.
I'll definitely need the setup for debugging.

--- Additional comment from SATHEESARAN on 2017-02-01 02:11:13 EST ---

With debugging from Raghavendra, here are the findings:

Parent down event happened at the timestamp - [2017-01-31 10:05:45.576350]

[2017-01-31 10:05:45.576350] I [MSGID: 114021] [client.c:2365:notify]
0-imagestore-client-0: current graph is no longer active, destroying rpc_client 
[2017-01-31 10:05:45.576430] I [MSGID: 114021] [client.c:2365:notify]
0-imagestore-client-1: current graph is no longer active, destroying rpc_client 
[2017-01-31 10:05:45.576474] I [MSGID: 114021] [client.c:2365:notify]
0-imagestore-client-2: current graph is no longer active, destroying rpc_client 

Fsync happens later at the timestamp - [2017-01-31 10:05:45.620808]

[2017-01-31 10:05:45.620808] W [fuse-bridge.c:1291:fuse_err_cbk]
0-glusterfs-fuse: 14861: FSYNC() ERR => -1 (Input/output error)

Which means there are 2 possibilities, of which both conditions looks like a
bug with graph switch code

1. If fsync is happening on the older graph, then parent down event should not
have happened.
2. If fsync was happened later after graph switch, it should have happened with
the new graph

Changing the component to fuse, as the bug is not really on RPC.

Please provide required acks, if changing the BZ component invalidates the
older ack

--- Additional comment from SATHEESARAN on 2017-02-01 02:12:48 EST ---

Assigning the bug to Csaba ( as the bug is around glusterfs-fuse ), and
providing the qa_ack as the BZ already had one

--- Additional comment from Raghavendra G on 2017-02-01 02:19:57 EST ---

https://bugzilla.redhat.com/show_bug.cgi?id=1387878 might be a similar issue

--- Additional comment from Krutika Dhananjay on 2017-02-02 01:45:18 EST ---

(In reply to SATHEESARAN from comment #18)
> With debugging from Raghavendra, here are the findings:
> 
> Parent down event happened at the timestamp - [2017-01-31 10:05:45.576350]
> 
> [2017-01-31 10:05:45.576350] I [MSGID: 114021] [client.c:2365:notify]
> 0-imagestore-client-0: current graph is no longer active, destroying
> rpc_client 
> [2017-01-31 10:05:45.576430] I [MSGID: 114021] [client.c:2365:notify]
> 0-imagestore-client-1: current graph is no longer active, destroying
> rpc_client 
> [2017-01-31 10:05:45.576474] I [MSGID: 114021] [client.c:2365:notify]
> 0-imagestore-client-2: current graph is no longer active, destroying
> rpc_client 
> 
> Fsync happens later at the timestamp - [2017-01-31 10:05:45.620808]
> 
> [2017-01-31 10:05:45.620808] W [fuse-bridge.c:1291:fuse_err_cbk]
> 0-glusterfs-fuse: 14861: FSYNC() ERR => -1 (Input/output error)
> 
> Which means there are 2 possibilities, of which both conditions looks like a
> bug with graph switch code
> 
> 1. If fsync is happening on the older graph, then parent down event should
> not have happened.
> 2. If fsync was happened later after graph switch, it should have happened
> with the new graph
> 
> Changing the component to fuse, as the bug is not really on RPC.
> 
> Please provide required acks, if changing the BZ component invalidates the
> older ack

+1. Contrary to what comment #4 says, it certainly doesn't look like an rpc bug
and resembles the graph switch issues reported by some of our community users.

What is not clear is the origin of the fop failure, as it doesn't look like a
failure due to connections on the old graph being torn down, since then we
should have been seeing 'Transport endpoint is not connected' as the cause of
the FSYNC() failure in the log, and also because protocol/clients should have
logged of ENOTCONN from client3_3_fsync_cbk().

Will check the code for possible origins of EIO (prime suspect being AFR), and
get back.

-Krutika

--- Additional comment from SATHEESARAN on 2017-02-02 23:56:55 EST ---

Update:

Raghavendra has given the private build over glusterfs-3.8.4-13 ( 02-Feb-2017 )
with few debug logs enabled.

I will be reproducing the issue with this build and provide the fuse mount
logs.

--- Additional comment from SATHEESARAN on 2017-02-03 07:56:07 EST ---

(In reply to SATHEESARAN from comment #24)
> Update:
> 
> Raghavendra has given the private build over glusterfs-3.8.4-13 (
> 02-Feb-2017 ) with few debug logs enabled.
> 
> I will be reproducing the issue with this build and provide the fuse mount
> logs.

Found some useful log messages in fuse mount logs when the bricks are added to
1X3 replica volume. I will attaching the fuse mount logs to this bug for future
reference.

--- Additional comment from SATHEESARAN on 2017-02-03 07:59:44 EST ---

Here is the snip of chat history about the issue:
<raghug> I am seeing fxattrop and finodelk on old graph even though there are
no on-going fops from fuse
<raghug> does afr do transactions for read fops?
<raghug> like readdir, readv etc
<raghug> ?
<itisravi> raghug: nope, it only does lookup/fstat if needed for refreshing the
inode context
<raghug> what are the fops it does transactions for?
<itisravi> raghug: but reads can also trigger self-heals via mount which can do
finodelk and fxattrop
<raghug> specifically finodelk and fxattrop
<itisravi> raghug: all write txns.
<raghug> itisravi, ah! that can explain
<raghug> reads could be triggered by read-ahead
<raghug> and afr could've triggered selfheal during that
<itisravi> raghug: ok, but if mount triggered afr selfheal ,you should see
"performing self-heal on gfid" or "completed self heal on gfid" etc.
<raghug> itisravi, let me check that

--- Additional comment from SATHEESARAN on 2017-02-06 01:19:41 EST ---

(In reply to SATHEESARAN from comment #24)
> Update:
> 
> Raghavendra has given the private build over glusterfs-3.8.4-13 (
> 02-Feb-2017 ) with few debug logs enabled.
> 
> I will be reproducing the issue with this build and provide the fuse mount
> logs.

I'm unable to hit the VM pause issue with the private build, which has logging
changes including debug messages for RCA'ing the issue.

Probably the logging changes has eliminated the occurrence of race.

--- Additional comment from SATHEESARAN on 2017-02-07 04:44:54 EST ---

Snips of logs from fuse mount log:

<snip>

[2017-02-07 09:39:19.451836] I [fuse-bridge.c:5241:fuse_graph_setup] 0-fuse:
switched to graph 2
[2017-02-07 09:39:19.452139] I [MSGID: 114035]
[client-handshake.c:201:client_set_lk_version_cbk] 2-pluto-client-5: Server lk
version = 1
[2017-02-07 09:39:19.551607] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7f666806406c) to newfd (0x7f666227e06c)
(inode-gfid:d18f89ef-6688-4d33-8d8f-221971639488)(old-subvolume:meta-autoload-0
new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.557778] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7f66680640f8) to newfd (0x7f666227e0f8)
(inode-gfid:fce8e590-da77-4b11-b43f-31f71f5aa2ca)(old-subvolume:meta-autoload-0
new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.562129] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7f6668064184) to newfd (0x7f666227e184)
(inode-gfid:a7ae1192-9f13-4d38-8f89-f34855453089)(old-subvolume:meta-autoload-0
new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.564971] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7f666806429c) to newfd (0x7f666227e210)
(inode-gfid:73378267-5712-4a6f-b5ca-b7f277c7533b)(old-subvolume:meta-autoload-0
new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.567667] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7f6668064210) to newfd (0x7f666227e29c)
(inode-gfid:23a02bc8-b52f-4b72-9907-fbfc8464d2a1)(old-subvolume:meta-autoload-0
new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.567820] I [MSGID: 114021] [client.c:2365:notify]
0-pluto-client-0: current graph is no longer active, destroying rpc_client 
[2017-02-07 09:39:19.567873] I [MSGID: 114021] [client.c:2365:notify]
0-pluto-client-1: current graph is no longer active, destroying rpc_client 
[2017-02-07 09:39:19.567918] I [MSGID: 114021] [client.c:2365:notify]
0-pluto-client-2: current graph is no longer active, destroying rpc_client 
[2017-02-07 09:39:19.567910] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-pluto-client-0: disconnected from
pluto-client-0. Client process will keep trying to connect to glusterd until
brick's port is available
[2017-02-07 09:39:19.567989] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-pluto-client-2: disconnected from
pluto-client-2. Client process will keep trying to connect to glusterd until
brick's port is available
[2017-02-07 09:39:19.567967] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-pluto-client-1: disconnected from
pluto-client-1. Client process will keep trying to connect to glusterd until
brick's port is available
[2017-02-07 09:39:19.568060] W [MSGID: 108001] [afr-common.c:4770:afr_notify]
0-pluto-replicate-0: Client-quorum is not met
[2017-02-07 09:39:19.568192] E [MSGID: 108006] [afr-common.c:4681:afr_notify]
0-pluto-replicate-0: All subvolumes are down. Going offline until atleast one
of them comes back up.
[2017-02-07 09:39:19.610218] W [fuse-bridge.c:2228:fuse_readv_cbk]
0-glusterfs-fuse: 49362: READ => -1 gfid=a7ae1192-9f13-4d38-8f89-f34855453089
fd=0x7f666227e184 (Input/output error)
[2017-02-07 09:39:19.610517] W [fuse-bridge.c:2228:fuse_readv_cbk]
0-glusterfs-fuse: 49375: READ => -1 gfid=a7ae1192-9f13-4d38-8f89-f34855453089
fd=0x7f666227e184 (Input/output error)
[2017-02-07 09:39:19.701440] W [fuse-bridge.c:1291:fuse_err_cbk]
0-glusterfs-fuse: 49368: FSYNC() ERR => -1 (Input/output error)

</snip>

--- Additional comment from SATHEESARAN on 2017-02-08 05:31:04 EST ---

(In reply to Raghavendra G from comment #31)
> Sas,
> 
> I've triggered a new build with less debug messages at:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12509975
> 
> This will spew less logs and hence hopefully we should hit the bug. Can you
> please rerun the tests with this new build?
> 
> regards,
> Raghavendra

Thanks Raghu.

I could hit the VM pause issue with this private build, I will upload the full
logs sooner. Here is the snip from the fuse mount logs

[2017-02-08 10:25:16.086736] I [fuse-bridge.c:4846:fuse_graph_sync]
0-meta-autoload: switched to new-graph: 2 old-graph: 0.
new_subvol=0x7f83bc07d5e0 old_subvol=0x7f83cc0159c0old_subvol->winds = 2 unique
= 38239
[2017-02-08 10:25:17.205378] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7f83c9f6906c) to newfd (0x7f83c804b06c)
(inode-gfid:8dad6ee2-a57f-47b8-ac06-648931200375)(old-subvolume:meta-autoload-0
new-subvolume:meta-autoload-2)
[2017-02-08 10:25:17.779502] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7f83c9f690f8) to newfd (0x7f83c804b0f8)
(inode-gfid:d04d7083-bdfe-4424-be50-a8ce01caa8a1)(old-subvolume:meta-autoload-0
new-subvolume:meta-autoload-2)
[2017-02-08 10:25:19.965103] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7f83c9f69184) to newfd (0x7f83c804b184)
(inode-gfid:e1e3195a-1b9a-495e-a583-5bde19532dae)(old-subvolume:meta-autoload-0
new-subvolume:meta-autoload-2)
[2017-02-08 10:25:20.653075] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7f83c9f69210) to newfd (0x7f83c804b210)
(inode-gfid:093291ca-8df7-4bf2-bd77-3765c3b6de34)(old-subvolume:meta-autoload-0
new-subvolume:meta-autoload-2)
[2017-02-08 10:25:21.403567] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7f83c9f6929c) to newfd (0x7f83c804b29c)
(inode-gfid:2325ebd0-4152-4405-8e8d-f98c27105990)(old-subvolume:meta-autoload-0
new-subvolume:meta-autoload-2)
[2017-02-08 10:25:21.403650] I [fuse-bridge.c:4863:fuse_graph_sync] 0-fuse:
sending PARENT_DOWN on graph:0 subvol: 0x7f83cc0159c0 unique = 38239
[2017-02-08 10:25:21.403966] I [MSGID: 114021] [client.c:2365:notify]
0-andromeda-client-0: current graph is no longer active, destroying rpc_client 
[2017-02-08 10:25:21.404061] I [MSGID: 114021] [client.c:2365:notify]
0-andromeda-client-1: current graph is no longer active, destroying rpc_client 
[2017-02-08 10:25:21.404108] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-andromeda-client-0: disconnected from
andromeda-client-0. Client process will keep trying to connect to glusterd
until brick's port is available
[2017-02-08 10:25:21.404142] I [MSGID: 114021] [client.c:2365:notify]
0-andromeda-client-2: current graph is no longer active, destroying rpc_client 
[2017-02-08 10:25:21.404178] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-andromeda-client-1: disconnected from
andromeda-client-1. Client process will keep trying to connect to glusterd
until brick's port is available
[2017-02-08 10:25:21.404203] W [MSGID: 108001] [afr-common.c:4770:afr_notify]
0-andromeda-replicate-0: Client-quorum is not met
[2017-02-08 10:25:21.404244] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-andromeda-client-2: disconnected from
andromeda-client-2. Client process will keep trying to connect to glusterd
until brick's port is available
[2017-02-08 10:25:21.404411] E [MSGID: 108006] [afr-common.c:4681:afr_notify]
0-andromeda-replicate-0: All subvolumes are down. Going offline until atleast
one of them comes back up.
[2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38318: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1
fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38320: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1
fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.455715] W [fuse-bridge.c:767:fuse_attr_cbk]
0-glusterfs-fuse: 38290: STAT() <gfid:8dad6ee2-a57f-47b8-ac06-648931200375> =>
-1 (Input/output error)
[2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38312: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375
fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38324: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375
fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38326: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375
fd=0x7f83c804b06c (Input/output error)

--- Additional comment from Raghavendra G on 2017-02-08 07:58:37 EST ---

Looks like its not an issue with graph switch in the sense fops are _not_ sent
on a graph which was notified with PARENT_DOWN event. To corroborate this see
the log:

[2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38318: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1
fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38320: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1
fd=0x7f83c804b0f8 (Input/output error)

If we look at the fd, it was opened on new graph as can be seen from following
log:
[2017-02-08 10:25:17.779502] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7f83c9f690f8) to newfd (0x7f83c804b0f8)
(inode-gfid:d04d7083-bdfe-4424-be50-a8ce01caa8a1)(old-subvolume:meta-autoload-0
new-subvolume:meta-autoload-2)

Similar logs for other fd too:
[2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38312: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375
fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38324: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375
fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38326: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375
fd=0x7f83c804b06c (Input/output error)

which was opened on new graph too:
[2017-02-08 10:25:17.205378] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
0-glusterfs-fuse: migrated basefd (0x7f83c9f6906c) to newfd (0x7f83c804b06c)
(inode-gfid:8dad6ee2-a57f-47b8-ac06-648931200375)(old-subvolume:meta-autoload-0
new-subvolume:meta-autoload-2)

I got confused from the log which logs fuse xlator's name with graph id 0
(0-glusterfs-fuse) associating the logs with graph 0. But however it turns out
that this->graph is NULL for fuse xlator and _gf_log uses "0" as id if
this->graph is NULL. So, the fops might as well be sent on new graph and
failures we see might be due to a different reason. The reason could be fops
were sent on new graph even before new-graph is up - connections were
established to brick. Note that there are log messages indicating that an afr
subvol is up in graph 2 between
1. fuse switches active graph to graph 2.
2. errors are seen on fops in fuse.

To prove that the failed fops were indeed sent on new graph, I've added few
more logs and the new debug build can be found at:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12511068

@Sas,

Can you please test with this new debug build? Thanks for your patience :)

regards,
Raghavendra

--- Additional comment from Raghavendra G on 2017-02-08 08:01:22 EST ---

(In reply to Raghavendra G from comment #34)
> I got confused from the log which logs fuse xlator's name with graph id 0
> (0-glusterfs-fuse) associating the logs with graph 0. But however it turns
> out that this->graph is NULL for fuse xlator and _gf_log uses "0" as id if
> this->graph is NULL. So, the fops might as well be sent on new graph and
> failures we see might be due to a different reason. The reason could be fops
> were sent on new graph even before new-graph is up - connections were
> established to brick. 

*before connections were established to bricks

> Note that there are log messages indicating that an

*Note that there are _NO_ log messages indicating that...

> afr subvol is up in graph 2 between
> 1. fuse switches active graph to graph 2.
> 2. errors are seen on fops in fuse.
> 
> To prove that the failed fops were indeed sent on new graph, I've added few
> more logs and the new debug build can be found at:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12511068
> 
> @Sas,
> 
> Can you please test with this new debug build? Thanks for your patience :)
> 
> regards,
> Raghavendra

--- Additional comment from Raghavendra G on 2017-02-08 10:24:17 EST ---

(In reply to Raghavendra G from comment #34)
> Looks like its not an issue with graph switch in the sense fops are _not_
> sent on a graph which was notified with PARENT_DOWN event. To corroborate
> this see the log:
> 
> [2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38318: WRITE => -1
> gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output
> error)
> [2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38320: WRITE => -1
> gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output
> error)
> 
> If we look at the fd, it was opened on new graph as can be seen from
> following log:
> [2017-02-08 10:25:17.779502] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
> 0-glusterfs-fuse: migrated basefd (0x7f83c9f690f8) to newfd (0x7f83c804b0f8)
> (inode-gfid:d04d7083-bdfe-4424-be50-a8ce01caa8a1)(old-subvolume:meta-
> autoload-0 new-subvolume:meta-autoload-2)
> 
> Similar logs for other fd too:
> [2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38312: WRITE => -1
> gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output
> error)
> [2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38324: WRITE => -1
> gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output
> error)
> [2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38326: WRITE => -1
> gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output
> error)
> 
> which was opened on new graph too:
> [2017-02-08 10:25:17.205378] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
> 0-glusterfs-fuse: migrated basefd (0x7f83c9f6906c) to newfd (0x7f83c804b06c)
> (inode-gfid:8dad6ee2-a57f-47b8-ac06-648931200375)(old-subvolume:meta-
> autoload-0 new-subvolume:meta-autoload-2)
> 
> I got confused from the log which logs fuse xlator's name with graph id 0
> (0-glusterfs-fuse) associating the logs with graph 0. But however it turns
> out that this->graph is NULL for fuse xlator and _gf_log uses "0" as id if
> this->graph is NULL. So, the fops might as well be sent on new graph and
> failures we see might be due to a different reason. The reason could be fops
> were sent on new graph even before new-graph is up - connections were
> established to brick. Note that there are log messages indicating that an
> afr subvol is up in graph 2 between
> 1. fuse switches active graph to graph 2.
> 2. errors are seen on fops in fuse.
> 
> To prove that the failed fops were indeed sent on new graph, I've added few
> more logs and the new debug build can be found at:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12511068

I've updated the debug build with more debug messages. Please use,
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12512422

> 
> @Sas,
> 
> Can you please test with this new debug build? Thanks for your patience :)
> 
> regards,
> Raghavendra

--- Additional comment from Raghavendra G on 2017-02-09 00:59:46 EST ---

Pursuing further the RCA of fuse-bridge not waiting till the new graph is up
before directing fops to it, I do see the code where fuse_graph_sync is waiting
on priv->sync_cond after initializing a new graph as active_subvol. However,
"notify" function which broadcasts a signal on priv->sync_cond whenever it
receives a CHILD_DOWN/CHILD_UP doesn't check for on which graph the event was
received before setting priv->event_recvd to true. For eg., consider the
scenario:

* fuse_graph_sync is waiting for a CHILD_UP/CHILD_DOWN on new graph by doing
pthread_cond_wait on priv->sync_cond
* notify receives a CHILD_DOWN on old-graph and signals priv->sync_cond

In the above scenario, fuse_graph_sync wakes up even though no
CHILD_UP/CHILD_DOWN was received on new graph and starts directing ops to new
graph, which will fail eventually till new graph is up.

There is some evidence in the logs too. Note that we started seeing failed fops
immediately after CHILD_DOWN event from afr to its parents (fuse-bridge) and
there were no errors before that.

[2017-02-08 10:25:21.404411] E [MSGID: 108006] [afr-common.c:4681:afr_notify]
0-andromeda-replicate-0: All subvolumes are down. Going offline until atleast
one of them comes back up.
[2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38318: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1
fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38320: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1
fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.455715] W [fuse-bridge.c:767:fuse_attr_cbk]
0-glusterfs-fuse: 38290: STAT() <gfid:8dad6ee2-a57f-47b8-ac06-648931200375> =>
-1 (Input/output error)
[2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38312: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375
fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38324: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375
fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38326: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375
fd=0x7f83c804b06c (Input/output error)

I'll wait for debug logs from sas to confirm the above RCA. If it turns out the
RCA above is correct, the fix would be to make fuse_graph_sync wait till a
CHILD_UP/CHILD_DOWN event on "new-graph" it just set as active-subvol instead
of waking up on receiving CHILD_UP/CHILD_DOWN on _any_ graph.

--- Additional comment from Raghavendra G on 2017-02-09 01:14:56 EST ---

Pursuing further the RCA of fuse-bridge not waiting till the new graph is up
before directing fops to it, I do see the code where fuse_graph_sync is waiting
on priv->sync_cond after initializing a new graph as active_subvol. However,
"notify" function which broadcasts a signal on priv->sync_cond whenever it
receives a CHILD_DOWN/CHILD_UP doesn't check for on which graph the event was
received before setting priv->event_recvd to true. For eg., consider the
scenario:

* fuse_graph_sync is waiting for a CHILD_UP/CHILD_DOWN on new graph by doing
pthread_cond_wait on priv->sync_cond
* notify receives a CHILD_DOWN on old-graph and signals priv->sync_cond

In the above scenario, fuse_graph_sync wakes up even though no
CHILD_UP/CHILD_DOWN was received on new graph and starts directing ops to new
graph, which will fail eventually till new graph is up.

There is some evidence in the logs too. Note that we started seeing failed fops
immediately after CHILD_DOWN event from afr to its parents (fuse-bridge) and
there were no errors before that.

[2017-02-08 10:25:21.404411] E [MSGID: 108006] [afr-common.c:4681:afr_notify]
0-andromeda-replicate-0: All subvolumes are down. Going offline until atleast
one of them comes back up.
[2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38318: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1
fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38320: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1
fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.455715] W [fuse-bridge.c:767:fuse_attr_cbk]
0-glusterfs-fuse: 38290: STAT() <gfid:8dad6ee2-a57f-47b8-ac06-648931200375> =>
-1 (Input/output error)
[2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38312: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375
fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38324: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375
fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk]
0-glusterfs-fuse: 38326: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375
fd=0x7f83c804b06c (Input/output error)

I'll wait for debug logs from sas to confirm the above RCA. If it turns out the
RCA above is correct, the fix would be to make fuse_graph_sync wait till a
CHILD_UP/CHILD_DOWN event on "new-graph" it just set as active-subvol instead
of waking up on receiving CHILD_UP/CHILD_DOWN on _any_ graph.

--- Additional comment from Worker Ant on 2017-02-22 04:36:21 EST ---

REVIEW: https://review.gluster.org/16709 (features/shard: Put the onus of
choosing the inode to resolve on individual fops) posted (#1) for review on
master by Krutika Dhananjay (kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2017-02-22 04:37:39 EST ---

REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read
failure due to EINVAL) posted (#4) for review on master by Krutika Dhananjay
(kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2017-02-22 05:17:28 EST ---

REVIEW: https://review.gluster.org/16709 (features/shard: Put onus of choosing
the inode to resolve on individual fops) posted (#2) for review on master by
Krutika Dhananjay (kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2017-02-22 05:31:18 EST ---

REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read
failure due to EINVAL) posted (#6) for review on master by Krutika Dhananjay
(kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2017-02-23 01:48:21 EST ---

REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read
failure due to EINVAL) posted (#7) for review on master by Pranith Kumar
Karampuri (pkarampu at redhat.com)

--- Additional comment from Worker Ant on 2017-02-23 01:48:27 EST ---

REVIEW: https://review.gluster.org/16709 (features/shard: Put onus of choosing
the inode to resolve on individual fops) posted (#3) for review on master by
Pranith Kumar Karampuri (pkarampu at redhat.com)

--- Additional comment from Worker Ant on 2017-02-23 01:53:11 EST ---

REVIEW: https://review.gluster.org/14419 (features/shard: Fix EIO error on
add-brick) posted (#8) for review on master by Krutika Dhananjay
(kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2017-02-23 01:53:17 EST ---

REVIEW: https://review.gluster.org/16709 (features/shard: Put the onus of
choosing the inode to resolve on individual fops) posted (#4) for review on
master by Krutika Dhananjay (kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2017-02-23 02:43:32 EST ---

REVIEW: https://review.gluster.org/14419 (features/shard: Fix EIO error on
add-brick) posted (#9) for review on master by Krutika Dhananjay
(kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2017-02-23 02:43:37 EST ---

REVIEW: https://review.gluster.org/16709 (features/shard: Put onus of choosing
the inode to resolve on individual fops) posted (#5) for review on master by
Krutika Dhananjay (kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2017-02-23 02:56:42 EST ---

COMMIT: https://review.gluster.org/16709 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com) 
------
commit 583e6cfc5bc73c2a79be9d42e89b90a71454596f
Author: Krutika Dhananjay <kdhananj at redhat.com>
Date:   Wed Feb 22 14:43:46 2017 +0530

    features/shard: Put onus of choosing the inode to resolve on individual
fops

    ... as opposed to adding checks in "common" functions to choose the inode
    to resolve based local->fop, which is rather ugly and prone to errors.

    Change-Id: Ia46cc59992baa2979516369cb72d8991452c0274
    BUG: 1420623
    Signed-off-by: Krutika Dhananjay <kdhananj at redhat.com>
    Reviewed-on: https://review.gluster.org/16709
    Tested-by: Pranith Kumar Karampuri <pkarampu at redhat.com>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu at redhat.com>

--- Additional comment from Worker Ant on 2017-02-23 05:50:55 EST ---

REVIEW: https://review.gluster.org/14419 (features/shard: Fix EIO error on
add-brick) posted (#10) for review on master by Krutika Dhananjay
(kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2017-02-23 23:57:47 EST ---

COMMIT: https://review.gluster.org/14419 committed in master by Atin Mukherjee
(amukherj at redhat.com) 
------
commit 1e2773cf1586b78c71e5b8adc24c6b65f065357c
Author: Krutika Dhananjay <kdhananj at redhat.com>
Date:   Tue May 17 15:37:18 2016 +0530

    features/shard: Fix EIO error on add-brick

    DHT seems to link inode during lookup even before initializing
    inode ctx with layout information, which comes after
    directory healing.

    Consider two parallel writes. As part of the first write,
    shard sends lookup on .shard which in its return path would
    cause DHT to link .shard inode. Now at this point, when a
    second write is wound, inode_find() of .shard succeeds and
    as a result of this, shard goes to create the participant
    shards by issuing MKNODs under .shard. Since the layout is
    yet to be initialized, mknod fails in dht call path with EIO,
    leading to VM pauses.

    The fix involves shard maintaining a flag to denote whether
    a fresh lookup on .shard completed one network trip. If it
    didn't, all inode_find()s in fop path will be followed by a
    lookup before proceeding with the next stage of the fop.

    Big thanks to Raghavendra G and Pranith Kumar K for the RCA
    and subsequent inputs and feedback on the patch.

    Change-Id: I9383ec7e3f24b34cd097a1b01ad34e4eeecc621f
    BUG: 1420623
    Signed-off-by: Krutika Dhananjay <kdhananj at redhat.com>
    Reviewed-on: https://review.gluster.org/14419
    Reviewed-by: Pranith Kumar Karampuri <pkarampu at redhat.com>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Atin Mukherjee <amukherj at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1387878
[Bug 1387878] Rebalance after add bricks corrupts files
-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.