[Gluster-users] glusterfs health-check failed, (brick) going down
Jiří Sléžka
jiri.slezka at slu.cz
Thu Jul 8 14:52:54 UTC 2021
Hi Olaf,
thanks for reply.
On 7/8/21 3:29 PM, Olaf Buitelaar wrote:
> Hi Jiri,
>
> your probleem looks pretty similar to mine, see;
> https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html
> <https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html>
> Any chance you also see the xfs errors in de brick logs?
yes, I can see this log lines related to "health-check failed" items
[root at ovirt-hci02 ~]# grep "aio_read" /var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
07:13:37.408010] W [MSGID: 113075]
[posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix:
aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check
returned ret is -1 error is Structure needs cleaning
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
16:11:14.518844] W [MSGID: 113075]
[posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix:
aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check
returned ret is -1 error is Structure needs cleaning
[root at ovirt-hci01 ~]# grep "aio_read" /var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
13:15:51.982938] W [MSGID: 113075]
[posix-helpers.c:2135:posix_fs_health_check] 0-engine-posix:
aio_read_cmp_buf() on
/gluster_bricks/engine/engine/.glusterfs/health_check returned ret is -1
error is Structure needs cleaning
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
01:53:35.768534] W [MSGID: 113075]
[posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix:
aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check
returned ret is -1 error is Structure needs cleaning
it looks very similar to your issue but in my case I don't use LVM cache
and brick disks are JBOD (but connected through Broadcom / LSI MegaRAID
SAS-3 3008 [Fury] (rev 02)).
> For me the situation improved once i disabled brick multiplexing, but i
> don't see that in your volume configuration.
probably important is your note...
> When i kill the brick process and start with "gluser v start x force" the
> issue seems much more unlikely to occur, but when started from a fresh
> reboot, or when killing the process and let it being started by glusterd
> (e.g. service glusterd start) the error seems to arise after a couple of
> minutes.
...because in the ovirt list Jayme replied this
https://lists.ovirt.org/archives/list/users@ovirt.org/message/BZRONK53OGWSOPUSGQ76GIXUM7J6HHMJ/
and it looks to me like something you also observes.
Cheers, Jiri
>
> Cheers Olaf
>
> Op do 8 jul. 2021 om 12:28 schreef Jiří Sléžka <jiri.slezka at slu.cz
> <mailto:jiri.slezka at slu.cz>>:
>
> Hello gluster community,
>
> I am new to this list but using glusterfs for log time as our SDS
> solution for storing 80+TiB of data. I'm also using glusterfs for small
> 3 node HCI cluster with oVirt 4.4.6 and CentOS 8 (not stream yet).
> Glusterfs version here is 8.5-2.el8.x86_64.
>
> For time to time (I belive) random brick on random host goes down
> because health-check. It looks like
>
> [root at ovirt-hci02 ~]# grep "posix_health_check"
> /var/log/glusterfs/bricks/*
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 07:13:37.408184] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 07:13:37.408407] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
> still
> alive! -> SIGTERM
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 16:11:14.518971] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 16:11:14.519200] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
> still
> alive! -> SIGTERM
>
> on other host
>
> [root at ovirt-hci01 ~]# grep "posix_health_check"
> /var/log/glusterfs/bricks/*
> /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> 13:15:51.983327] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> 13:15:51.983728] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix:
> still alive! -> SIGTERM
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> 01:53:35.769129] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> 01:53:35.769819] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
> still
> alive! -> SIGTERM
>
> I cannot link these errors to any storage/fs issue (in dmesg or
> /var/log/messages), brick devices looks healthy (smartd).
>
> I can force start brick with
>
> gluster volume start vms|engine force
>
> and after some healing all works fine for few days
>
> Did anybody observe this behavior?
>
> vms volume has this structure (two bricks per host, each is separate
> JBOD ssd disk), engine volume has one brick on each host...
>
> gluster volume info vms
>
> Volume Name: vms
> Type: Distributed-Replicate
> Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 2 x 3 = 6
> Transport-type: tcp
> Bricks:
> Brick1: 10.0.4.11:/gluster_bricks/vms/vms
> Brick2: 10.0.4.13:/gluster_bricks/vms/vms
> Brick3: 10.0.4.12:/gluster_bricks/vms/vms
> Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2
> Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2
> Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2
> Options Reconfigured:
> cluster.granular-entry-heal: enable
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> user.cifs: off
> network.ping-timeout: 30
> network.remote-dio: off
> performance.strict-o-direct: on
> performance.low-prio-threads: 32
> features.shard: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> transport.address-family: inet
> storage.fips-mode-rchecksum: on
> nfs.disable: on
> performance.client-io-threads: off
>
>
> Cheers,
>
> Jiri
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> <https://meet.google.com/cpu-eiue-hvk>
> Gluster-users mailing list
> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> https://lists.gluster.org/mailman/listinfo/gluster-users
> <https://lists.gluster.org/mailman/listinfo/gluster-users>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4269 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210708/0f6d2a80/attachment.p7s>
More information about the Gluster-users
mailing list