[Gluster-users] glusterfs health-check failed, (brick) going down

Thu Jul 8 13:29:06 UTC 2021

Hi Jiri,

your probleem looks pretty similar to mine, see;
https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html
Any chance you also see the xfs errors in de brick logs?
For me the situation improved once i disabled brick multiplexing, but i
don't see that in your volume configuration.

Cheers Olaf

Op do 8 jul. 2021 om 12:28 schreef Jiří Sléžka <jiri.slezka at slu.cz>:

> Hello gluster community,
>
> I am new to this list but using glusterfs for log time as our SDS
> solution for storing 80+TiB of data. I'm also using glusterfs for small
> 3 node HCI cluster with oVirt 4.4.6 and CentOS 8 (not stream yet).
> Glusterfs version here is 8.5-2.el8.x86_64.
>
> For time to time (I belive) random brick on random host goes down
> because health-check. It looks like
>
> [root at ovirt-hci02 ~]# grep "posix_health_check"
> /var/log/glusterfs/bricks/*
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 07:13:37.408184] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 07:13:37.408407] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still
> alive! -> SIGTERM
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 16:11:14.518971] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 16:11:14.519200] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still
> alive! -> SIGTERM
>
> on other host
>
> [root at ovirt-hci01 ~]# grep "posix_health_check"
> /var/log/glusterfs/bricks/*
> /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> 13:15:51.983327] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> 13:15:51.983728] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix:
> still alive! -> SIGTERM
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> 01:53:35.769129] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> 01:53:35.769819] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still
> alive! -> SIGTERM
>
> I cannot link these errors to any storage/fs issue (in dmesg or
> /var/log/messages), brick devices looks healthy (smartd).
>
> I can force start brick with
>
> gluster volume start vms|engine force
>
> and after some healing all works fine for few days
>
> Did anybody observe this behavior?
>
> vms volume has this structure (two bricks per host, each is separate
> JBOD ssd disk), engine volume has one brick on each host...
>
> gluster volume info vms
>
> Volume Name: vms
> Type: Distributed-Replicate
> Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 2 x 3 = 6
> Transport-type: tcp
> Bricks:
> Brick1: 10.0.4.11:/gluster_bricks/vms/vms
> Brick2: 10.0.4.13:/gluster_bricks/vms/vms
> Brick3: 10.0.4.12:/gluster_bricks/vms/vms
> Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2
> Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2
> Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2
> Options Reconfigured:
> cluster.granular-entry-heal: enable
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> user.cifs: off
> network.ping-timeout: 30
> network.remote-dio: off
> performance.strict-o-direct: on
> performance.low-prio-threads: 32
> features.shard: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> transport.address-family: inet
> storage.fips-mode-rchecksum: on
> nfs.disable: on
> performance.client-io-threads: off
>
>
> Cheers,
>
> Jiri
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210708/445136ec/attachment.html>