[Gluster-users] glusterfs health-check failed, (brick) going down

Thu Jul 8 15:36:47 UTC 2021

Hi Jiri,

Unfortunately i don't know a solution to fix it, other than what i already
mentioned, which doesn't seem to be applicable to your specific setup.
I don't think it's ovirt related (running ovirt my-self as well, but being
stuck at 4.3 atm, since centos 7 is not supported for 4.4).
If memory serves me well, i believe i start seeing this issue after
upgrading from glusterfs 3.12 to 4.x (I believe this whent together with
upgrade from ovirt 4.1 to 4.2), then since is version i've observed this
issue. currently running 7.9.
It would be nice to get to the bottom of this. I'm still not 100% sure it
might even be a glusterfs issue, or something might be wrong with XFS or
somewhere else in the IO stack. But I don't know what the next debugging
steps could be.
Just as a side note i've also observed this issue on systems without LVM
cache.

Cheers Olaf

Op do 8 jul. 2021 om 16:53 schreef Jiří Sléžka <jiri.slezka at slu.cz>:

> Hi Olaf,
>
> thanks for reply.
>
> On 7/8/21 3:29 PM, Olaf Buitelaar wrote:
> > Hi Jiri,
> >
> > your probleem looks pretty similar to mine, see;
> >
> https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html
> > <
> https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html
> >
> > Any chance you also see the xfs errors in de brick logs?
>
> yes, I can see this log lines related to "health-check failed" items
>
> [root at ovirt-hci02 ~]# grep "aio_read" /var/log/glusterfs/bricks/*
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 07:13:37.408010] W [MSGID: 113075]
> [posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix:
> aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check
> returned ret is -1 error is Structure needs cleaning
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 16:11:14.518844] W [MSGID: 113075]
> [posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix:
> aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check
> returned ret is -1 error is Structure needs cleaning
>
> [root at ovirt-hci01 ~]# grep "aio_read" /var/log/glusterfs/bricks/*
> /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> 13:15:51.982938] W [MSGID: 113075]
> [posix-helpers.c:2135:posix_fs_health_check] 0-engine-posix:
> aio_read_cmp_buf() on
> /gluster_bricks/engine/engine/.glusterfs/health_check returned ret is -1
> error is Structure needs cleaning
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> 01:53:35.768534] W [MSGID: 113075]
> [posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix:
> aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check
> returned ret is -1 error is Structure needs cleaning
>
> it looks very similar to your issue but in my case I don't use LVM cache
> and brick disks are JBOD (but connected through Broadcom / LSI MegaRAID
> SAS-3 3008 [Fury] (rev 02)).
>
> > For me the situation improved once i disabled brick multiplexing, but i
> > don't see that in your volume configuration.
>
> probably important is your note...
>
> > When i kill the brick process and start with "gluser v start x force" the
> > issue seems much more unlikely to occur, but when started from a fresh
> > reboot, or when killing the process and let it being started by glusterd
> > (e.g. service glusterd start) the error seems to arise after a couple of
> > minutes.
>
> ...because in the ovirt list Jayme replied this
>
>
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/BZRONK53OGWSOPUSGQ76GIXUM7J6HHMJ/
>
> and it looks to me like something you also observes.
>
> Cheers, Jiri
>
> >
> > Cheers Olaf
> >
> > Op do 8 jul. 2021 om 12:28 schreef Jiří Sléžka <jiri.slezka at slu.cz
> > <mailto:jiri.slezka at slu.cz>>:
> >
> >     Hello gluster community,
> >
> >     I am new to this list but using glusterfs for log time as our SDS
> >     solution for storing 80+TiB of data. I'm also using glusterfs for
> small
> >     3 node HCI cluster with oVirt 4.4.6 and CentOS 8 (not stream yet).
> >     Glusterfs version here is 8.5-2.el8.x86_64.
> >
> >     For time to time (I belive) random brick on random host goes down
> >     because health-check. It looks like
> >
> >     [root at ovirt-hci02 ~]# grep "posix_health_check"
> >     /var/log/glusterfs/bricks/*
> >     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> >     07:13:37.408184] M [MSGID: 113075]
> >     [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> >     health-check failed, going down
> >     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> >     07:13:37.408407] M [MSGID: 113075]
> >     [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
> >     still
> >     alive! -> SIGTERM
> >     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> >     16:11:14.518971] M [MSGID: 113075]
> >     [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> >     health-check failed, going down
> >     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> >     16:11:14.519200] M [MSGID: 113075]
> >     [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
> >     still
> >     alive! -> SIGTERM
> >
> >     on other host
> >
> >     [root at ovirt-hci01 ~]# grep "posix_health_check"
> >     /var/log/glusterfs/bricks/*
> >
>  /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> >     13:15:51.983327] M [MSGID: 113075]
> >     [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix:
> >     health-check failed, going down
> >
>  /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> >     13:15:51.983728] M [MSGID: 113075]
> >     [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix:
> >     still alive! -> SIGTERM
> >     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> >     01:53:35.769129] M [MSGID: 113075]
> >     [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> >     health-check failed, going down
> >     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> >     01:53:35.769819] M [MSGID: 113075]
> >     [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
> >     still
> >     alive! -> SIGTERM
> >
> >     I cannot link these errors to any storage/fs issue (in dmesg or
> >     /var/log/messages), brick devices looks healthy (smartd).
> >
> >     I can force start brick with
> >
> >     gluster volume start vms|engine force
> >
> >     and after some healing all works fine for few days
> >
> >     Did anybody observe this behavior?
> >
> >     vms volume has this structure (two bricks per host, each is separate
> >     JBOD ssd disk), engine volume has one brick on each host...
> >
> >     gluster volume info vms
> >
> >     Volume Name: vms
> >     Type: Distributed-Replicate
> >     Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7
> >     Status: Started
> >     Snapshot Count: 0
> >     Number of Bricks: 2 x 3 = 6
> >     Transport-type: tcp
> >     Bricks:
> >     Brick1: 10.0.4.11:/gluster_bricks/vms/vms
> >     Brick2: 10.0.4.13:/gluster_bricks/vms/vms
> >     Brick3: 10.0.4.12:/gluster_bricks/vms/vms
> >     Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2
> >     Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2
> >     Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2
> >     Options Reconfigured:
> >     cluster.granular-entry-heal: enable
> >     performance.stat-prefetch: off
> >     cluster.eager-lock: enable
> >     performance.io-cache: off
> >     performance.read-ahead: off
> >     performance.quick-read: off
> >     user.cifs: off
> >     network.ping-timeout: 30
> >     network.remote-dio: off
> >     performance.strict-o-direct: on
> >     performance.low-prio-threads: 32
> >     features.shard: on
> >     storage.owner-gid: 36
> >     storage.owner-uid: 36
> >     transport.address-family: inet
> >     storage.fips-mode-rchecksum: on
> >     nfs.disable: on
> >     performance.client-io-threads: off
> >
> >
> >     Cheers,
> >
> >     Jiri
> >
> >     ________
> >
> >
> >
> >     Community Meeting Calendar:
> >
> >     Schedule -
> >     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> >     Bridge: https://meet.google.com/cpu-eiue-hvk
> >     <https://meet.google.com/cpu-eiue-hvk>
> >     Gluster-users mailing list
> >     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> >     https://lists.gluster.org/mailman/listinfo/gluster-users
> >     <https://lists.gluster.org/mailman/listinfo/gluster-users>
> >
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210708/7d1c5047/attachment.html>