[Gluster-users] KVM lockups on Gluster 4.1.1

Amar Tumballi atumball at redhat.com
Mon Aug 20 15:38:30 UTC 2018


On Mon, Aug 20, 2018 at 6:20 PM, Walter Deignan <WDeignan at uline.com> wrote:

> I upgraded late last week to 4.1.2. Since then I've seen several posix
> health checks fail and bricks drop offline but I'm not sure if that's
> related or a different root issue.
>
> I haven't seen the issue described below re-occur on 4.1.2 yet but it was
> intermittent to begin with so I'll probably need to run for a week or more
> to be confident.
>
>
Thanks for the update! We will be trying to reproduce the issue, and also
root cause based on analysis of code, but if you get us brick logs around
the time this happens, it may fasttrack the issue.

Thanks again,
Amar


> -Walter Deignan
> -Uline IT, Systems Architect
>
>
>
> From:        "Claus Jeppesen" <cjeppesen at datto.com>
> To:        WDeignan at uline.com
> Cc:        gluster-users at gluster.org
> Date:        08/20/2018 07:20 AM
> Subject:        Re: [Gluster-users] KVM lockups on Gluster 4.1.1
> ------------------------------
>
>
>
> I think I have seen this also on our CentOS 7.5 systems using GlusterFS
> 4.1.1 (*) - has an upgrade to 4.1.2 helped out ? I'm trying this now.
>
> Thanx,
>
> Claus.
>
> (*)  libvirt/quemu log:
> [2018-08-19 16:45:54.275830] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
> 0-glu-vol01-lab-client-0: remote operation failed [Invalid argument]
> [2018-08-19 16:45:54.276156] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
> 0-glu-vol01-lab-client-1: remote operation failed [Invalid argument]
> [2018-08-19 16:45:54.276159] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
> 0-glu-vol01-lab-replicate-0: path=(null) gfid=00000000-0000-0000-0000-000000000000:
> unlock failed on subvolume glu-vol
> 01-lab-client-0 with lock owner 28ae497049560000 [Invalid argument]
> [2018-08-19 16:45:54.276183] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
> 0-glu-vol01-lab-replicate-0: path=(null) gfid=00000000-0000-0000-0000-000000000000:
> unlock failed on subvolume glu-vol
> 01-lab-client-1 with lock owner 28ae497049560000 [Invalid argument]
> [2018-08-19 17:16:03.690808] E [rpc-clnt.c:184:call_bail]
> 0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
> op(FINODELK(30)) xid = 0x3071a5 sent = 2018-08-19 16:45:54.276560. timeout
> = 1800 for
> *192.168.13.131:49152* <http://192.168.13.131:49152/>
> [2018-08-19 17:16:03.691113] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
> 0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
> not connected]
> [2018-08-19 17:46:03.855909] E [rpc-clnt.c:184:call_bail]
> 0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
> op(FINODELK(30)) xid = 0x301d0f sent = 2018-08-19 17:16:03.691174. timeout
> = 1800 for
> *192.168.13.132:49152* <http://192.168.13.132:49152/>
> [2018-08-19 17:46:03.856170] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
> 0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
> not connected]
> block I/O error in device 'drive-virtio-disk0': Operation not permitted
> (1)
> ... many repeats ...
> block I/O error in device 'drive-virtio-disk0': Operation not permitted
> (1)
> [2018-08-19 18:16:04.022526] E [rpc-clnt.c:184:call_bail]
> 0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
> op(FINODELK(30)) xid = 0x307221 sent = 2018-08-19 17:46:03.861005. timeout
> = 1800 for
> *192.168.13.131:49152* <http://192.168.13.131:49152/>
> [2018-08-19 18:16:04.022788] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
> 0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
> not connected]
> [2018-08-19 18:46:04.195590] E [rpc-clnt.c:184:call_bail]
> 0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
> op(FINODELK(30)) xid = 0x301d8a sent = 2018-08-19 18:16:04.022838. timeout
> = 1800 for
> *192.168.13.132:49152* <http://192.168.13.132:49152/>
> [2018-08-19 18:46:04.195881] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
> 0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
> not connected]
> block I/O error in device 'drive-virtio-disk0': Operation not permitted
> (1)
> block I/O error in device 'drive-virtio-disk0': Operation not permitted
> (1)
> block I/O error in device 'drive-virtio-disk0': Operation not permitted
> (1)
> block I/O error in device 'drive-virtio-disk0': Operation not permitted
> (1)
> block I/O error in device 'drive-virtio-disk0': Operation not permitted
> (1)
> qemu: terminating on signal 15 from pid 507
> 2018-08-19 19:36:59.065+0000: shutting down, reason=destroyed
> 2018-08-19 19:37:08.059+0000: starting up libvirt version: 3.9.0, package:
> 14.el7_5.6 (CentOS BuildSystem <*http://bugs.centos.org*
> <http://bugs.centos.org/>>, 2018-06-27-14:13:57, *x86-01.bsys.centos.org*
> <http://x86-01.bsys.centos.org/>), qemu version: 1.5.3 (qemu-kvm-1.
> 5.3-156.el7_5.3)
>
> At 19:37 the VM was restarted.
>
>
>
> On Wed, Aug 15, 2018 at 8:25 PM Walter Deignan <*WDeignan at uline.com*
> <WDeignan at uline.com>> wrote:
> I am using gluster to host KVM/QEMU images. I am seeing an intermittent
> issue where access to an image will hang. I have to do a lazy dismount of
> the gluster volume in order to break the lock and then reset the impacted
> virtual machine.
>
> It happened again today and I caught the events below in the client side
> logs. Any thoughts on what might cause this? It seemed to begin after I
> upgraded from 3.12.10 to 4.1.1 a few weeks ago.
>
> [2018-08-14 14:22:15.549501] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
> 2-gv1-client-4: remote operation failed [Invalid argument]
> [2018-08-14 14:22:15.549576] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
> 2-gv1-client-5: remote operation failed [Invalid argument]
> [2018-08-14 14:22:15.549583] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
> 2-gv1-replicate-2: path=(null) gfid=00000000-0000-0000-0000-000000000000:
> unlock failed on subvolume gv1-client-4 with lock owner d89caca92b7f0000
> [Invalid argument]
> [2018-08-14 14:22:15.549615] E [MSGID: 108010] [afr-lk-common.c:284:afr_unlock_inodelk_cbk]
> 2-gv1-replicate-2: path=(null) gfid=00000000-0000-0000-0000-000000000000:
> unlock failed on subvolume gv1-client-5 with lock owner d89caca92b7f0000
> [Invalid argument]
> [2018-08-14 14:52:18.726219] E [rpc-clnt.c:184:call_bail] 2-gv1-client-4:
> bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc5e00
> sent = 2018-08-14 14:22:15.699082. timeout = 1800 for *10.35.20.106:49159*
> <http://10.35.20.106:49159/>
> [2018-08-14 14:52:18.726254] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
> 2-gv1-client-4: remote operation failed [Transport endpoint is not
> connected]
> [2018-08-14 15:22:25.962546] E [rpc-clnt.c:184:call_bail] 2-gv1-client-5:
> bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc4a6d
> sent = 2018-08-14 14:52:18.726329. timeout = 1800 for *10.35.20.107:49164*
> <http://10.35.20.107:49164/>
> [2018-08-14 15:22:25.962587] E [MSGID: 114031] [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
> 2-gv1-client-5: remote operation failed [Transport endpoint is not
> connected]
> [2018-08-14 15:22:25.962618] W [MSGID: 108019] [afr-lk-common.c:601:is_
> blocking_locks_count_sufficient] 2-gv1-replicate-2: Unable to obtain
> blocking inode lock on even one child for gfid:24a48cae-53fe-4634-8fb7-
> 0254c85ad672.
> [2018-08-14 15:22:25.962668] W [fuse-bridge.c:1441:fuse_err_cbk]
> 0-glusterfs-fuse: 3715808: FSYNC() ERR => -1 (Transport endpoint is not
> connected)
>
> Volume configuration -
>
> Volume Name: gv1
> Type: Distributed-Replicate
> Volume ID: 66ad703e-3bae-4e79-a0b7-29ea38e8fcfc
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 5 x 2 = 10
> Transport-type: tcp
> Bricks:
> Brick1: dc-vihi44:/gluster/bricks/megabrick/data
> Brick2: dc-vihi45:/gluster/bricks/megabrick/data
> Brick3: dc-vihi44:/gluster/bricks/brick1/data
> Brick4: dc-vihi45:/gluster/bricks/brick1/data
> Brick5: dc-vihi44:/gluster/bricks/brick2_1/data
> Brick6: dc-vihi45:/gluster/bricks/brick2/data
> Brick7: dc-vihi44:/gluster/bricks/brick3/data
> Brick8: dc-vihi45:/gluster/bricks/brick3/data
> Brick9: dc-vihi44:/gluster/bricks/brick4/data
> Brick10: dc-vihi45:/gluster/bricks/brick4/data
> Options Reconfigured:
> cluster.min-free-inodes: 6%
> performance.client-io-threads: off
> nfs.disable: on
> transport.address-family: inet
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.low-prio-threads: 32
> network.remote-dio: enable
> cluster.eager-lock: enable
> cluster.server-quorum-type: server
> cluster.data-self-heal-algorithm: full
> cluster.locking-scheme: granular
> cluster.shd-max-threads: 8
> cluster.shd-wait-qlength: 10000
> user.cifs: off
> cluster.choose-local: off
> features.shard: on
> cluster.server-quorum-ratio: 51%
>
> -Walter Deignan
> -Uline IT, Systems Architect_____________________
> __________________________
> Gluster-users mailing list
> *Gluster-users at gluster.org* <Gluster-users at gluster.org>
> *https://lists.gluster.org/mailman/listinfo/gluster-users*
> <https://lists.gluster.org/mailman/listinfo/gluster-users>
>
>
> --
> *Claus Jeppesen*
> Manager, Network Services
> Datto, Inc.
> p +45 6170 5901 | Copenhagen Office
> *www.datto.com* <http://www.datto.com/>
>
> <http://www.datto.com/datto-signature/>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>



-- 
Amar Tumballi (amarts)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180820/dd149793/attachment.html>


More information about the Gluster-users mailing list