[Gluster-users] glusterfs health-check failed, (brick) going down

Thu Jul 8 14:52:54 UTC 2021

Hi Olaf,

thanks for reply.

On 7/8/21 3:29 PM, Olaf Buitelaar wrote:
> Hi Jiri,
> 
> your probleem looks pretty similar to mine, see; 
> https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html 
> <https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html>
> Any chance you also see the xfs errors in de brick logs?

yes, I can see this log lines related to "health-check failed" items

[root at ovirt-hci02 ~]# grep "aio_read" /var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 
07:13:37.408010] W [MSGID: 113075] 
[posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix: 
aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check 
returned ret is -1 error is Structure needs cleaning
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 
16:11:14.518844] W [MSGID: 113075] 
[posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix: 
aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check 
returned ret is -1 error is Structure needs cleaning

[root at ovirt-hci01 ~]# grep "aio_read" /var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 
13:15:51.982938] W [MSGID: 113075] 
[posix-helpers.c:2135:posix_fs_health_check] 0-engine-posix: 
aio_read_cmp_buf() on 
/gluster_bricks/engine/engine/.glusterfs/health_check returned ret is -1 
error is Structure needs cleaning
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 
01:53:35.768534] W [MSGID: 113075] 
[posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix: 
aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check 
returned ret is -1 error is Structure needs cleaning

it looks very similar to your issue but in my case I don't use LVM cache 
and brick disks are JBOD (but connected through Broadcom / LSI MegaRAID 
SAS-3 3008 [Fury] (rev 02)).

> For me the situation improved once i disabled brick multiplexing, but i 
> don't see that in your volume configuration.

probably important is your note...

> When i kill the brick process and start with "gluser v start x force" the
> issue seems much more unlikely to occur, but when started from a fresh
> reboot, or when killing the process and let it being started by glusterd
> (e.g. service glusterd start) the error seems to arise after a couple of
> minutes.

...because in the ovirt list Jayme replied this

https://lists.ovirt.org/archives/list/users@ovirt.org/message/BZRONK53OGWSOPUSGQ76GIXUM7J6HHMJ/

and it looks to me like something you also observes.

Cheers, Jiri

> 
> Cheers Olaf
> 
> Op do 8 jul. 2021 om 12:28 schreef Jiří Sléžka <jiri.slezka at slu.cz 
> <mailto:jiri.slezka at slu.cz>>:
> 
>     Hello gluster community,
> 
>     I am new to this list but using glusterfs for log time as our SDS
>     solution for storing 80+TiB of data. I'm also using glusterfs for small
>     3 node HCI cluster with oVirt 4.4.6 and CentOS 8 (not stream yet).
>     Glusterfs version here is 8.5-2.el8.x86_64.
> 
>     For time to time (I belive) random brick on random host goes down
>     because health-check. It looks like
> 
>     [root at ovirt-hci02 ~]# grep "posix_health_check"
>     /var/log/glusterfs/bricks/*
>     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
>     07:13:37.408184] M [MSGID: 113075]
>     [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
>     health-check failed, going down
>     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
>     07:13:37.408407] M [MSGID: 113075]
>     [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
>     still
>     alive! -> SIGTERM
>     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
>     16:11:14.518971] M [MSGID: 113075]
>     [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
>     health-check failed, going down
>     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
>     16:11:14.519200] M [MSGID: 113075]
>     [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
>     still
>     alive! -> SIGTERM
> 
>     on other host
> 
>     [root at ovirt-hci01 ~]# grep "posix_health_check"
>     /var/log/glusterfs/bricks/*
>     /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
>     13:15:51.983327] M [MSGID: 113075]
>     [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix:
>     health-check failed, going down
>     /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
>     13:15:51.983728] M [MSGID: 113075]
>     [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix:
>     still alive! -> SIGTERM
>     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
>     01:53:35.769129] M [MSGID: 113075]
>     [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
>     health-check failed, going down
>     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
>     01:53:35.769819] M [MSGID: 113075]
>     [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
>     still
>     alive! -> SIGTERM
> 
>     I cannot link these errors to any storage/fs issue (in dmesg or
>     /var/log/messages), brick devices looks healthy (smartd).
> 
>     I can force start brick with
> 
>     gluster volume start vms|engine force
> 
>     and after some healing all works fine for few days
> 
>     Did anybody observe this behavior?
> 
>     vms volume has this structure (two bricks per host, each is separate
>     JBOD ssd disk), engine volume has one brick on each host...
> 
>     gluster volume info vms
> 
>     Volume Name: vms
>     Type: Distributed-Replicate
>     Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7
>     Status: Started
>     Snapshot Count: 0
>     Number of Bricks: 2 x 3 = 6
>     Transport-type: tcp
>     Bricks:
>     Brick1: 10.0.4.11:/gluster_bricks/vms/vms
>     Brick2: 10.0.4.13:/gluster_bricks/vms/vms
>     Brick3: 10.0.4.12:/gluster_bricks/vms/vms
>     Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2
>     Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2
>     Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2
>     Options Reconfigured:
>     cluster.granular-entry-heal: enable
>     performance.stat-prefetch: off
>     cluster.eager-lock: enable
>     performance.io-cache: off
>     performance.read-ahead: off
>     performance.quick-read: off
>     user.cifs: off
>     network.ping-timeout: 30
>     network.remote-dio: off
>     performance.strict-o-direct: on
>     performance.low-prio-threads: 32
>     features.shard: on
>     storage.owner-gid: 36
>     storage.owner-uid: 36
>     transport.address-family: inet
>     storage.fips-mode-rchecksum: on
>     nfs.disable: on
>     performance.client-io-threads: off
> 
> 
>     Cheers,
> 
>     Jiri
> 
>     ________
> 
> 
> 
>     Community Meeting Calendar:
> 
>     Schedule -
>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     Bridge: https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk>
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>     https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users>
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4269 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210708/0f6d2a80/attachment.p7s>