[Gluster-users] KVM lockups on Gluster 4.1.1

Claus Jeppesen cjeppesen at datto.com
Tue Aug 21 06:50:32 UTC 2018


Hi Amar,

Unfortunately I do not have the GlusterFS brick logs anymore - however I do
have a hint:
I have 2 gluster (4.1.1) glusterfs volumes where I saw the issue - each has
about 10-12 VMs active.
I also have 2 addl.  gluster (4.1.1) glusterfs volumes, but with only 3-4
VMs, where I did not see the
issue (and they had been running for 1-2 months).

Thanx,

Claus.

P.S. We are talking about using Gluster "URI" with qemu - I hope - e.g. like

   <disk type='network' device='disk'>
     <driver name='qemu' type='raw' cache='none' io='native'/>
     <source protocol='gluster' name='glu-vol03-lab/install3'>
       <host name='install2.vlan13' port='24007'/>
     </source>
     <target dev='vda' bus='virtio'/>
   </disk>




On Mon, Aug 20, 2018 at 5:39 PM Amar Tumballi <atumball at redhat.com> wrote:

>
>
> On Mon, Aug 20, 2018 at 6:20 PM, Walter Deignan <WDeignan at uline.com>
> wrote:
>
>> I upgraded late last week to 4.1.2. Since then I've seen several posix
>> health checks fail and bricks drop offline but I'm not sure if that's
>> related or a different root issue.
>>
>> I haven't seen the issue described below re-occur on 4.1.2 yet but it was
>> intermittent to begin with so I'll probably need to run for a week or more
>> to be confident.
>>
>>
> Thanks for the update! We will be trying to reproduce the issue, and also
> root cause based on analysis of code, but if you get us brick logs around
> the time this happens, it may fasttrack the issue.
>
> Thanks again,
> Amar
>
>
>> -Walter Deignan
>> -Uline IT, Systems Architect
>>
>>
>>
>> From:        "Claus Jeppesen" <cjeppesen at datto.com>
>> To:        WDeignan at uline.com
>> Cc:        gluster-users at gluster.org
>> Date:        08/20/2018 07:20 AM
>> Subject:        Re: [Gluster-users] KVM lockups on Gluster 4.1.1
>> ------------------------------
>>
>>
>>
>> I think I have seen this also on our CentOS 7.5 systems using GlusterFS
>> 4.1.1 (*) - has an upgrade to 4.1.2 helped out ? I'm trying this now.
>>
>> Thanx,
>>
>> Claus.
>>
>> (*)  libvirt/quemu log:
>> [2018-08-19 16:45:54.275830] E [MSGID: 114031]
>> [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
>> 0-glu-vol01-lab-client-0: remote operation failed [Invalid argument]
>> [2018-08-19 16:45:54.276156] E [MSGID: 114031]
>> [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
>> 0-glu-vol01-lab-client-1: remote operation failed [Invalid argument]
>> [2018-08-19 16:45:54.276159] E [MSGID: 108010]
>> [afr-lk-common.c:284:afr_unlock_inodelk_cbk] 0-glu-vol01-lab-replicate-0:
>> path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on
>> subvolume glu-vol
>> 01-lab-client-0 with lock owner 28ae497049560000 [Invalid argument]
>> [2018-08-19 16:45:54.276183] E [MSGID: 108010]
>> [afr-lk-common.c:284:afr_unlock_inodelk_cbk] 0-glu-vol01-lab-replicate-0:
>> path=(null) gfid=00000000-0000-0000-0000-000000000000: unlock failed on
>> subvolume glu-vol
>> 01-lab-client-1 with lock owner 28ae497049560000 [Invalid argument]
>> [2018-08-19 17:16:03.690808] E [rpc-clnt.c:184:call_bail]
>> 0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
>> op(FINODELK(30)) xid = 0x3071a5 sent = 2018-08-19 16:45:54.276560. timeout
>> = 1800 for
>> *192.168.13.131:49152* <http://192.168.13.131:49152/>
>> [2018-08-19 17:16:03.691113] E [MSGID: 114031]
>> [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
>> 0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
>> not connected]
>> [2018-08-19 17:46:03.855909] E [rpc-clnt.c:184:call_bail]
>> 0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
>> op(FINODELK(30)) xid = 0x301d0f sent = 2018-08-19 17:16:03.691174. timeout
>> = 1800 for
>> *192.168.13.132:49152* <http://192.168.13.132:49152/>
>> [2018-08-19 17:46:03.856170] E [MSGID: 114031]
>> [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
>> 0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
>> not connected]
>> block I/O error in device 'drive-virtio-disk0': Operation not permitted
>> (1)
>> ... many repeats ...
>> block I/O error in device 'drive-virtio-disk0': Operation not permitted
>> (1)
>> [2018-08-19 18:16:04.022526] E [rpc-clnt.c:184:call_bail]
>> 0-glu-vol01-lab-client-0: bailing out frame type(GlusterFS 4.x v1)
>> op(FINODELK(30)) xid = 0x307221 sent = 2018-08-19 17:46:03.861005. timeout
>> = 1800 for
>> *192.168.13.131:49152* <http://192.168.13.131:49152/>
>> [2018-08-19 18:16:04.022788] E [MSGID: 114031]
>> [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
>> 0-glu-vol01-lab-client-0: remote operation failed [Transport endpoint is
>> not connected]
>> [2018-08-19 18:46:04.195590] E [rpc-clnt.c:184:call_bail]
>> 0-glu-vol01-lab-client-1: bailing out frame type(GlusterFS 4.x v1)
>> op(FINODELK(30)) xid = 0x301d8a sent = 2018-08-19 18:16:04.022838. timeout
>> = 1800 for
>> *192.168.13.132:49152* <http://192.168.13.132:49152/>
>> [2018-08-19 18:46:04.195881] E [MSGID: 114031]
>> [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk]
>> 0-glu-vol01-lab-client-1: remote operation failed [Transport endpoint is
>> not connected]
>> block I/O error in device 'drive-virtio-disk0': Operation not permitted
>> (1)
>> block I/O error in device 'drive-virtio-disk0': Operation not permitted
>> (1)
>> block I/O error in device 'drive-virtio-disk0': Operation not permitted
>> (1)
>> block I/O error in device 'drive-virtio-disk0': Operation not permitted
>> (1)
>> block I/O error in device 'drive-virtio-disk0': Operation not permitted
>> (1)
>> qemu: terminating on signal 15 from pid 507
>> 2018-08-19 19:36:59.065+0000: shutting down, reason=destroyed
>> 2018-08-19 19:37:08.059+0000: starting up libvirt version: 3.9.0,
>> package: 14.el7_5.6 (CentOS BuildSystem <*http://bugs.centos.org*
>> <http://bugs.centos.org/>>, 2018-06-27-14:13:57, *x86-01.bsys.centos.org*
>> <http://x86-01.bsys.centos.org/>), qemu version: 1.5.3 (qemu-kvm-1.
>> 5.3-156.el7_5.3)
>>
>> At 19:37 the VM was restarted.
>>
>>
>>
>> On Wed, Aug 15, 2018 at 8:25 PM Walter Deignan <*WDeignan at uline.com*
>> <WDeignan at uline.com>> wrote:
>> I am using gluster to host KVM/QEMU images. I am seeing an intermittent
>> issue where access to an image will hang. I have to do a lazy dismount of
>> the gluster volume in order to break the lock and then reset the impacted
>> virtual machine.
>>
>> It happened again today and I caught the events below in the client side
>> logs. Any thoughts on what might cause this? It seemed to begin after I
>> upgraded from 3.12.10 to 4.1.1 a few weeks ago.
>>
>> [2018-08-14 14:22:15.549501] E [MSGID: 114031]
>> [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote
>> operation failed [Invalid argument]
>> [2018-08-14 14:22:15.549576] E [MSGID: 114031]
>> [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote
>> operation failed [Invalid argument]
>> [2018-08-14 14:22:15.549583] E [MSGID: 108010]
>> [afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2: path=(null)
>> gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume
>> gv1-client-4 with lock owner d89caca92b7f0000 [Invalid argument]
>> [2018-08-14 14:22:15.549615] E [MSGID: 108010]
>> [afr-lk-common.c:284:afr_unlock_inodelk_cbk] 2-gv1-replicate-2: path=(null)
>> gfid=00000000-0000-0000-0000-000000000000: unlock failed on subvolume
>> gv1-client-5 with lock owner d89caca92b7f0000 [Invalid argument]
>> [2018-08-14 14:52:18.726219] E [rpc-clnt.c:184:call_bail] 2-gv1-client-4:
>> bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc5e00
>> sent = 2018-08-14 14:22:15.699082. timeout = 1800 for
>> *10.35.20.106:49159* <http://10.35.20.106:49159/>
>> [2018-08-14 14:52:18.726254] E [MSGID: 114031]
>> [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-4: remote
>> operation failed [Transport endpoint is not connected]
>> [2018-08-14 15:22:25.962546] E [rpc-clnt.c:184:call_bail] 2-gv1-client-5:
>> bailing out frame type(GlusterFS 4.x v1) op(FINODELK(30)) xid = 0xc4a6d
>> sent = 2018-08-14 14:52:18.726329. timeout = 1800 for
>> *10.35.20.107:49164* <http://10.35.20.107:49164/>
>> [2018-08-14 15:22:25.962587] E [MSGID: 114031]
>> [client-rpc-fops_v2.c:1352:client4_0_finodelk_cbk] 2-gv1-client-5: remote
>> operation failed [Transport endpoint is not connected]
>> [2018-08-14 15:22:25.962618] W [MSGID: 108019]
>> [afr-lk-common.c:601:is_blocking_locks_count_sufficient] 2-gv1-replicate-2:
>> Unable to obtain blocking inode lock on even one child for
>> gfid:24a48cae-53fe-4634-8fb7-0254c85ad672.
>> [2018-08-14 15:22:25.962668] W [fuse-bridge.c:1441:fuse_err_cbk]
>> 0-glusterfs-fuse: 3715808: FSYNC() ERR => -1 (Transport endpoint is not
>> connected)
>>
>> Volume configuration -
>>
>> Volume Name: gv1
>> Type: Distributed-Replicate
>> Volume ID: 66ad703e-3bae-4e79-a0b7-29ea38e8fcfc
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 5 x 2 = 10
>> Transport-type: tcp
>> Bricks:
>> Brick1: dc-vihi44:/gluster/bricks/megabrick/data
>> Brick2: dc-vihi45:/gluster/bricks/megabrick/data
>> Brick3: dc-vihi44:/gluster/bricks/brick1/data
>> Brick4: dc-vihi45:/gluster/bricks/brick1/data
>> Brick5: dc-vihi44:/gluster/bricks/brick2_1/data
>> Brick6: dc-vihi45:/gluster/bricks/brick2/data
>> Brick7: dc-vihi44:/gluster/bricks/brick3/data
>> Brick8: dc-vihi45:/gluster/bricks/brick3/data
>> Brick9: dc-vihi44:/gluster/bricks/brick4/data
>> Brick10: dc-vihi45:/gluster/bricks/brick4/data
>> Options Reconfigured:
>> cluster.min-free-inodes: 6%
>> performance.client-io-threads: off
>> nfs.disable: on
>> transport.address-family: inet
>> performance.quick-read: off
>> performance.read-ahead: off
>> performance.io-cache: off
>> performance.low-prio-threads: 32
>> network.remote-dio: enable
>> cluster.eager-lock: enable
>> cluster.server-quorum-type: server
>> cluster.data-self-heal-algorithm: full
>> cluster.locking-scheme: granular
>> cluster.shd-max-threads: 8
>> cluster.shd-wait-qlength: 10000
>> user.cifs: off
>> cluster.choose-local: off
>> features.shard: on
>> cluster.server-quorum-ratio: 51%
>>
>> -Walter Deignan
>> -Uline IT, Systems Architect
>> _______________________________________________
>> Gluster-users mailing list
>> *Gluster-users at gluster.org* <Gluster-users at gluster.org>
>> *https://lists.gluster.org/mailman/listinfo/gluster-users*
>> <https://lists.gluster.org/mailman/listinfo/gluster-users>
>>
>>
>> --
>> *Claus Jeppesen*
>> Manager, Network Services
>> Datto, Inc.
>> p +45 6170 5901 | Copenhagen Office
>> *www.datto.com* <http://www.datto.com/>
>>
>> <http://www.datto.com/datto-signature/>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
>
> --
> Amar Tumballi (amarts)
>


-- 
*Claus Jeppesen*
Manager, Network Services
Datto, Inc.
p +45 6170 5901 | Copenhagen Office
www.datto.com

<http://www.datto.com/datto-signature/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180821/2fcbd331/attachment.html>


More information about the Gluster-users mailing list