[Bugs] [Bug 1775522] New: Qemu VM issues IO errors with gluster native driver & SSL enabled

bugzilla at redhat.com bugzilla at redhat.com
Fri Nov 22 07:40:04 UTC 2019


https://bugzilla.redhat.com/show_bug.cgi?id=1775522

            Bug ID: 1775522
           Summary: Qemu VM issues IO errors with gluster native driver &
                    SSL enabled
           Product: GlusterFS
           Version: 5
            Status: NEW
         Component: qemu-block
          Assignee: bugs at gluster.org
          Reporter: g.fhnrunznrqeqf at noclue.notk.org
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community



Description of problem:
qemu VMs sometimes get IO errors and mark disks as read-only in the guest.
A simple reset within the guest solves the issue so it is a transient error
that the glusterapi recovers (cannot just remount,rw though), but we cannot run
VMs that go read-only once in a while :)

Version-Release number of selected component (if applicable):
Using pve 6 (debian buster (10)) versions:
 - gluster 5.5-3
 - pve-qemu-kvm 4.0.0-3

How reproducible:
Has happened once a week on some of our VMs (small test cluster with 6 VMs
right now) over the past two weeks.
I could not reproduce manually by running a heavy IO load so I think it is
unrelated to the activity. The VMs with problems weren't fully idle either
(there are syslog messages just before the problem).
I could not reproduce either by killing gluster daemons.

I could, however, reproduce with tcpkill as described below.

The fuse `glusterfs` mount does not have the problem.
I did not test if disabling SSL fixes the problem, but this would probably have
been noticed before if not ;)


Steps to Reproduce:
1. Setup a gluster volume with SSL enabled and replication=2, and configure a
VM to use it with gluster://<volume>/disk
2. Identify the port used by qemu with e.g. `ss -npt | grep <qemu pid>` and run
`tcpkill -i <iface> host <qemu host> and port <qemu port>` -- it is wise to
check nothing else uses that ip/port combo wherever you run tcpkill (possible
if the port is used as source port somewhere else)
3. Do some IO on vm, e.g. `echo foo | dd of=fileongluster conv=fsync`

Actual results:
VM does an IO error

Expected results:
VM only writes on brick where it works, and file goes through normal healing
mechanisms when other brick reconnects (or whatever happens without SSL)


Additional info:
The only logs I could find were about connection dropping on a single brick,
nothing in qemu logs or the other brick.
Here is what it looks like when the problem really happened (it is different
when tcpkill resets the connection as that looks like a proper connection
close, but I am fairly confident the problem is the same)

[2019-11-17 08:25:17.124430] E [socket.c:246:ssl_dump_error_stack]
0-tcp.data-server:   error:1408F09C:SSL routines:ssl3_get_record:http request
[2019-11-17 08:25:17.124485] W [socket.c:1386:__socket_read_simple_msg]
0-tcp.data-server: reading from socket failed. Error (Input/output error), peer
(10.0.2.XY:49117)
[2019-11-17 08:25:17.124511] I [socket.c:3712:socket_submit_outgoing_msg]
0-tcp.data-server: not connected (priv->connected = -1)
[2019-11-17 08:25:17.124516] I [MSGID: 115036] [server.c:469:server_rpc_notify]
0-data-server: disconnecting connection from
CTX_ID:84b07bad-ef0d-4532-91a5-82edc7657c17-GRAPH_ID:23-PID:929457-HOST:hyperv1-PC_NAME:data-client-0-RECON_NO:-0
[2019-11-17 08:25:17.124545] E [rpcsvc.c:1381:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0xf534, Program: GlusterFS 4.x
v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.124576] W [inodelk.c:609:pl_inodelk_log_cleanup]
0-data-server: releasing lock on 2b156905-db61-4c11-b1c8-8356d7ebd554 held by
{client=0x7f5198039a30, pid=929457 lk-owner=2847a09f5c7f0000}
[2019-11-17 08:25:17.124607] I [MSGID: 115013]
[server-helpers.c:283:do_fd_cleanup] 0-data-server: fd cleanup on
/images/169076/vm-169076-disk-1.qcow2
[2019-11-17 08:25:17.124621] E [server.c:133:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913)
[0x7f521e948913]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108)
[0x7f521e8e8108]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3)
[0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.124717] E [rpcsvc.c:1381:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0xf535, Program: GlusterFS 4.x
v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.124797] E [server.c:133:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913)
[0x7f521e948913]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108)
[0x7f521e8e8108]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3)
[0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.125129] E [rpcsvc.c:1381:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0xf536, Program: GlusterFS 4.x
v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.125181] E [server.c:133:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913)
[0x7f521e948913]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108)
[0x7f521e8e8108]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3)
[0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.125360] E [rpcsvc.c:1381:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0xf537, Program: GlusterFS 4.x
v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.125409] E [server.c:133:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913)
[0x7f521e948913]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108)
[0x7f521e8e8108]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3)
[0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.125542] E [rpcsvc.c:1381:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0xf538, Program: GlusterFS 4.x
v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.125585] E [server.c:133:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913)
[0x7f521e948913]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108)
[0x7f521e8e8108]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3)
[0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.125761] E [rpcsvc.c:1381:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0xf539, Program: GlusterFS 4.x
v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.125876] E [server.c:133:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913)
[0x7f521e948913]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108)
[0x7f521e8e8108]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3)
[0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.125960] E [rpcsvc.c:1381:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0xf53a, Program: GlusterFS 4.x
v1, ProgVers: 400, Proc: 13) to rpc-transport (tcp.data-server)
[2019-11-17 08:25:17.126011] E [server.c:133:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/debug/io-stats.so(+0x16913)
[0x7f521e948913]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0x50108)
[0x7f521e8e8108]
-->/usr/lib/x86_64-linux-gnu/glusterfs/5.5/xlator/protocol/server.so(+0xb7c3)
[0x7f521e8a37c3] ) 0-: Reply submission failed
[2019-11-17 08:25:17.126039] I [MSGID: 101055] [client_t.c:435:gf_client_unref]
0-data-server: Shutting down connection
CTX_ID:84b07bad-ef0d-4532-91a5-82edc7657c17-GRAPH_ID:23-PID:929457-HOST:hyperv1-PC_NAME:data-client-0-RECON_NO:-0
[2019-11-17 08:25:27.618791] I [addr.c:54:compare_addr_and_update]
0-/mnt/data/gluster-data: allowed = "*", received addr = "10.0.2.XY"
[2019-11-17 08:25:27.618827] I [login.c:46:gf_auth] 0-auth/login: connecting
user name: hyperv1.fdn.fr
[2019-11-17 08:25:27.618835] I [login.c:110:gf_auth] 0-auth/login: allowed user
names: hyperv2.fdn.fr,hyperv1.fdn.fr
[2019-11-17 08:25:27.618843] I [MSGID: 115029]
[server-handshake.c:537:server_setvolume] 0-data-server: accepted client from
CTX_ID:84b07bad-ef0d-4532-91a5-82edc7657c17-GRAPH_ID:23-PID:929457-HOST:hyperv1-PC_NAME:data-client-0-RECON_NO:-1
(version: 5.5)


I will have a look at the code and pinpoint something more precise either
tonight or next week if it hasn't been identified yet, it looks fairly
straightforward, to check if this has been fixed by now as I do not have
anywhere to test if the problem is still present on master branch.

Thanks!

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list