[Gluster-users] GlusterFS mount crash

Tue Nov 22 08:41:17 UTC 2022

Hi Xavi,

The OS is Debian 11 with the proxmox kernel. Gluster packages are the
official from gluster.org (
https://download.gluster.org/pub/gluster/glusterfs/10/10.3/Debian/bullseye/)

The system logs showed no other issues by the time of the crash, no OOM
kill or whatsoever, and no other process was interacting with the gluster
mountpoint besides proxmox.

I wasn't running gdb when it crashed, so I don't really know if I can
obtain a more detailed trace from logs or if there is a simple way to let
it running in the background to see if it happens again (or there is a flag
to start the systemd daemon in debug mode).

Best,

*Angel Docampo*
<https://www.google.com/maps/place/Edificio+de+Oficinas+Euro+3/@41.3755943,2.0730134,17z/data=!3m2!4b1!5s0x12a4997021aad323:0x3e06bf8ae6d68351!4m5!3m4!1s0x12a4997a67bf592f:0x83c2323a9cc2aa4b!8m2!3d41.3755903!4d2.0752021>
  <angel.docampo at eoniantec.com>  <+34-93-1592929>

El lun, 21 nov 2022 a las 15:16, Xavi Hernandez (<jahernan at redhat.com>)
escribió:

> Hi Angel,
>
> On Mon, Nov 21, 2022 at 2:33 PM Angel Docampo <angel.docampo at eoniantec.com>
> wrote:
>
>> Sorry for necrobumping this, but this morning I've suffered this on my
>> Proxmox  + GlusterFS cluster. In the log I can see this
>>
>> [2022-11-21 07:38:00.213620 +0000] I [MSGID: 133017]
>> [shard.c:7275:shard_seek] 11-vmdata-shard: seek called on
>> fbc063cb-874e-475d-b585-f89
>> f7518acdd. [Operation not supported]
>> pending frames:
>> frame : type(1) op(WRITE)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> frame : type(0) op(0)
>> ...
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> frame : type(1) op(FSYNC)
>> patchset: git://git.gluster.org/glusterfs.git
>> signal received: 11
>> time of crash:
>> 2022-11-21 07:38:00 +0000
>> configuration details:
>> argp 1
>> backtrace 1
>> dlfcn 1
>> libpthread 1
>> llistxattr 1
>> setfsid 1
>> epoll.h 1
>> xattr.h 1
>> st_atim.tv_nsec 1
>> package-string: glusterfs 10.3
>> /lib/x86_64-linux-gnu/libglusterfs.so.0(+0x28a54)[0x7f74f286ba54]
>> /lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x700)[0x7f74f2873fc0]
>>
>> /lib/x86_64-linux-gnu/libc.so.6(+0x38d60)[0x7f74f262ed60]
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/cluster/disperse.so(+0x37a14)[0x7f74ecfcea14]
>>
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/cluster/disperse.so(+0x19414)[0x7f74ecfb0414]
>>
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/cluster/disperse.so(+0x16373)[0x7f74ecfad373]
>>
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/cluster/disperse.so(+0x21d59)[0x7f74ecfb8d59]
>>
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/cluster/disperse.so(+0x22815)[0x7f74ecfb9815]
>>
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/cluster/disperse.so(+0x377d9)[0x7f74ecfce7d9]
>>
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/cluster/disperse.so(+0x19414)[0x7f74ecfb0414]
>>
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/cluster/disperse.so(+0x16373)[0x7f74ecfad373]
>>
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/cluster/disperse.so(+0x170f9)[0x7f74ecfae0f9]
>>
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/cluster/disperse.so(+0x313bb)[0x7f74ecfc83bb]
>>
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/protocol/client.so(+0x48e3a)[0x7f74ed06ce3a]
>>
>> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xfccb)[0x7f74f2816ccb]
>> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x26)[0x7f74f2812646]
>>
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/rpc-transport/socket.so(+0x64c8)[0x7f74ee15f4c8]
>>
>> /usr/lib/x86_64-linux-gnu/glusterfs/10.3/rpc-transport/socket.so(+0xd38c)[0x7f74ee16638c]
>>
>> /lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7971d)[0x7f74f28bc71d]
>> /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7)[0x7f74f27d2ea7]
>> /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f74f26f2aef]
>> ---------
>> The mount point wasn't accessible with the "Transport endpoint is not
>> connected" message and it was shown like this.
>> d?????????   ? ?    ?            ?            ? vmdata
>>
>> I had to stop all the VMs on that proxmox node, then stop the gluster
>> daemon to ummount de directory, and after starting the daemon and
>> re-mounting, all was working again.
>>
>> My gluster volume info returns this
>>
>> Volume Name: vmdata
>> Type: Distributed-Disperse
>> Volume ID: cace5aa4-b13a-4750-8736-aa179c2485e1
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 2 x (2 + 1) = 6
>> Transport-type: tcp
>> Bricks:
>> Brick1: g01:/data/brick1/brick
>> Brick2: g02:/data/brick2/brick
>> Brick3: g03:/data/brick1/brick
>> Brick4: g01:/data/brick2/brick
>> Brick5: g02:/data/brick1/brick
>> Brick6: g03:/data/brick2/brick
>> Options Reconfigured:
>> nfs.disable: on
>> transport.address-family: inet
>> storage.fips-mode-rchecksum: on
>> features.shard: enable
>> features.shard-block-size: 256MB
>> performance.read-ahead: off
>> performance.quick-read: off
>> performance.io-cache: off
>> server.event-threads: 2
>> client.event-threads: 3
>> performance.client-io-threads: on
>> performance.stat-prefetch: off
>> dht.force-readdirp: off
>> performance.force-readdirp: off
>> network.remote-dio: on
>> features.cache-invalidation: on
>> performance.parallel-readdir: on
>> performance.readdir-ahead: on
>>
>> Xavi, do you think the open-behind off setting can help somehow? I did
>> try to understand what it does (with no luck), and if it could impact the
>> performance of my VMs (I've the setup you know so well ;))
>> I would like to avoid more crashings like this, version 10.3 of gluster
>> was working since two weeks ago, quite well until this morning.
>>
>
> I don't think disabling open-behind will have any visible effect on
> performance. Open-behind is only useful for small files when the workload
> is mostly open + read + close, and quick-read is also enabled (which is not
> your case). The only effect it will have is that the latency "saved" during
> open is "paid" on the next operation sent to the file, so the total overall
> latency should be the same. Additionally, VM workload doesn't open files
> frequently, so it shouldn't matter much in any case.
>
> That said, I'm not sure if the problem is the same in your case. Based on
> the stack of the crash, it seems an issue inside the disperse module.
>
> What OS are you using ? are you using official packages ?  if so, which
> ones ?
>
> Is it possible to provide a backtrace from gdb ?
>
> Regards,
>
> Xavi
>
>
>> *Angel Docampo*
>>
>> <https://www.google.com/maps/place/Edificio+de+Oficinas+Euro+3/@41.3755943,2.0730134,17z/data=!3m2!4b1!5s0x12a4997021aad323:0x3e06bf8ae6d68351!4m5!3m4!1s0x12a4997a67bf592f:0x83c2323a9cc2aa4b!8m2!3d41.3755903!4d2.0752021>
>>   <angel.docampo at eoniantec.com>  <+34-93-1592929>
>>
>>
>> El vie, 19 mar 2021 a las 2:10, David Cunningham (<
>> dcunningham at voisonics.com>) escribió:
>>
>>> Hi Xavi,
>>>
>>> Thank you for that information. We'll look at upgrading it.
>>>
>>>
>>> On Fri, 12 Mar 2021 at 05:20, Xavi Hernandez <jahernan at redhat.com>
>>> wrote:
>>>
>>>> Hi David,
>>>>
>>>> with so little information it's hard to tell, but given that there are
>>>> several OPEN and UNLINK operations, it could be related to an already fixed
>>>> bug (in recent versions) in open-behind.
>>>>
>>>> You can try disabling open-behind with this command:
>>>>
>>>>     # gluster volume set <volname> open-behind off
>>>>
>>>> But given the version you are using is very old and unmaintained, I
>>>> would recommend you to upgrade to 8.x at least.
>>>>
>>>> Regards,
>>>>
>>>> Xavi
>>>>
>>>>
>>>> On Wed, Mar 10, 2021 at 5:10 AM David Cunningham <
>>>> dcunningham at voisonics.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We have a GlusterFS 5.13 server which also mounts itself with the
>>>>> native FUSE client. Recently the FUSE mount crashed and we found the
>>>>> following in the syslog. There isn't anything logged in mnt-glusterfs.log
>>>>> for that time. After killing all processes with a file handle open on the
>>>>> filesystem we were able to unmount and then remount the filesystem
>>>>> successfully.
>>>>>
>>>>> Would anyone have advice on how to debug this crash? Thank you in
>>>>> advance!
>>>>>
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: pending frames:
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: frame : type(0) op(0)
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: frame : type(0) op(0)
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: frame : type(1) op(UNLINK)
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: frame : type(1) op(UNLINK)
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: frame : type(1) op(OPEN)
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: message repeated 3355 times:
>>>>> [ frame : type(1) op(OPEN)]
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: frame : type(1) op(OPEN)
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: message repeated 6965 times:
>>>>> [ frame : type(1) op(OPEN)]
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: frame : type(1) op(OPEN)
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: message repeated 4095 times:
>>>>> [ frame : type(1) op(OPEN)]
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: frame : type(0) op(0)
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: patchset: git://
>>>>> git.gluster.org/glusterfs.git
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: signal received: 11
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: time of crash:
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: 2021-03-09 03:12:31
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: configuration details:
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: argp 1
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: backtrace 1
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: dlfcn 1
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: libpthread 1
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: llistxattr 1
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: setfsid 1
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: spinlock 1
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: epoll.h 1
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: xattr.h 1
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: st_atim.tv_nsec 1
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: package-string: glusterfs
>>>>> 5.13
>>>>> Mar 9 05:12:31 voip1 mnt-glusterfs[2932]: ---------
>>>>> ...
>>>>> Mar 9 05:13:50 voip1 systemd[1]: glusterfssharedstorage.service: Main
>>>>> process exited, code=killed, status=11/SEGV
>>>>> Mar 9 05:13:50 voip1 systemd[1]: glusterfssharedstorage.service:
>>>>> Failed with result 'signal'.
>>>>> ...
>>>>> Mar 9 05:13:54 voip1 systemd[1]: glusterfssharedstorage.service:
>>>>> Service hold-off time over, scheduling restart.
>>>>> Mar 9 05:13:54 voip1 systemd[1]: glusterfssharedstorage.service:
>>>>> Scheduled restart job, restart counter is at 2.
>>>>> Mar 9 05:13:54 voip1 systemd[1]: Stopped Mount glusterfs sharedstorage.
>>>>> Mar 9 05:13:54 voip1 systemd[1]: Starting Mount glusterfs
>>>>> sharedstorage...
>>>>> Mar 9 05:13:54 voip1 mount-shared-storage.sh[20520]: ERROR: Mount
>>>>> point does not exist
>>>>> Mar 9 05:13:54 voip1 mount-shared-storage.sh[20520]: Please specify a
>>>>> mount point
>>>>> Mar 9 05:13:54 voip1 mount-shared-storage.sh[20520]: Usage:
>>>>> Mar 9 05:13:54 voip1 mount-shared-storage.sh[20520]: man 8
>>>>> /sbin/mount.glusterfs
>>>>>
>>>>> --
>>>>> David Cunningham, Voisonics Limited
>>>>> http://voisonics.com/
>>>>> USA: +1 213 221 1092
>>>>> New Zealand: +64 (0)28 2558 3782
>>>>> ________
>>>>>
>>>>>
>>>>>
>>>>> Community Meeting Calendar:
>>>>>
>>>>> Schedule -
>>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>>> Bridge: https://meet.google.com/cpu-eiue-hvk
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>
>>>>
>>>
>>> --
>>> David Cunningham, Voisonics Limited
>>> http://voisonics.com/
>>> USA: +1 213 221 1092
>>> New Zealand: +64 (0)28 2558 3782
>>> ________
>>>
>>>
>>>
>>> Community Meeting Calendar:
>>>
>>> Schedule -
>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>> Bridge: https://meet.google.com/cpu-eiue-hvk
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20221122/570d3768/attachment.html>