[Gluster-users] Gluster fuse mount freezes entire server until killed and remounted

Artem Russakovskii archon810 at gmail.com
Tue May 12 17:14:15 UTC 2020


Hi all,

We have been observing a critical issue that started in the last several
months and has already randomly affected our servers 3 times.


*The symptoms:*

   - df stops responding and hangs
   - sometimes apache and nginx stop responding and all requests hang,
   sometimes they continue working, though nginx returns incomplete results

Upon investigating the issue when it happened again today, I narrowed it
down to glusterfs and specifically one of the fuse mount processes.

df freezes like this:
stat("/run/user/0", {stmode=SIFDIR|0700, stsize=100, …}) = 0
stat("/var/run/user/0", {stmode=SIFDIR|0700, stsize=100, …}) = 0
stat("/run/user/1000", {stmode=SIFDIR|0700, stsize=80, …}) = 0
stat("/var/run/user/1000", {stmode=SIFDIR|0700, stsize=80, …}) = 0
stat("/sys/kernel/debug/tracing", 0x7ffc32784ef0) = -1 EACCES (Permission
denied)
stat("/mnt/androidpolicedata3", {stmode=SIFDIR|0755, stsize=4096, …}) = 0
stat("/mnt/apkmirror_data1", ^C^C^C^C^C

/mnt/apkmirrordata1 is a fuse mount by glusterfs corresponding to this
attached block device:
/dev/disk/by-id/scsi-0LinodeVolumehiveblock1 /mnt/hive_block1 xfs defaults
0 2

It's pretty crazy that any access to this /mnt/apkmirror_data1 location
freezes any program issuing the stat call indefinitely.

During this time, the block device itself was reachable and I could list
files, so I have to assume the issue lies somewhere in glusterfs, fuse, or
the kernel.

After killing this process
root 9485 1 6 Apr30 ? 18:36:26 /usr/sbin/glusterfs --process-name fuse
--volfile-server=localhost --volfile-id=/apkmirrordata1 /mnt/apkmirrordata1
and remounting the fuse mount, everything returned back to normal.

One of my suspicions is the issue started when we upgraded our OpenSUSE
15.1 machines from 5.1.17 kernel to 5.4.10. Machines with 5.1.17 haven't
experienced it, while only machines running 5.4.10 did. It took 15 days
after the last reboot to hit the issue today, so it's very sporadic, but
also very critical when it does hit.


*Questions:*

   1. How can we tell what specific fuse version is being used by gluster?
   2. Are there any gluster or fuse parameters that control the fuse
   timeout, so that perhaps it internally tries to remount if fuse hangs?
   Currently, it's mounted like this:
   localhost:/apkmirror_data1 /mnt/apkmirror_data1 glusterfs
   defaults,_netdev 0 0
   3. Does the team have any further thoughts or perhaps someone knows how
   to fix the issue or has seen a kernel or fuse/gluster advisory?

Thank you.

Sincerely,
Artem

--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror
<http://www.apkmirror.com/>, Illogical Robot LLC
beerpla.net | @ArtemR
<http://twitter.com/ArtemR>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200512/4af4852f/attachment.html>


More information about the Gluster-users mailing list