<div dir="ltr">Dear Users,<div><br></div><div>Somehow the brick processes seem to crash on xfs filesystem error&#39;s. It seems it depends on the way the gluster process is started. Also gluster sends on this occurrence a message to the console, informing the process will go down, however it doesn&#39;t really seem to go down;</div><div><br></div><div>M [MSGID: 113075] [posix-helpers.c:2185:posix_health_check_thread_proc] 0-ovirt-engine-posix: health-check failed, going down<br></div><div> M [MSGID: 113075] [posix-helpers.c:2203:posix_health_check_thread_proc] 0-ovirt-engine-posix: still alive! -&gt; SIGTERM<br></div><div><br></div><div>in the brick log a message like this is logged;</div><div>[posix-helpers.c:2111:posix_fs_health_check] 0-ovirt-data-posix: aio_read_cmp_buf() on /data5/gfs/bricks/brick1/ovirt-data/.glusterfs/health_check returned ret is -1 error is Structure needs cleaning<br></div><div><br></div><div>or like this;</div><div> W [MSGID: 113075] [posix-helpers.c:2111:posix_fs_health_check] 0-ovirt-mon-2-posix: aio_read_buf() on /data0/gfs/bricks/bricka/ovirt-mon-2/.glusterfs/health_check returned ret is -1 error is Success<br></div><div><br></div><div>when i check the actual file it just seems to contain a timestamp;</div><div>cat /data0/gfs/bricks/bricka/ovirt-mon-2/.glusterfs/health_check<br>2021-01-28 09:08:01⏎<br></div><div><br></div><div>And don&#39;t see errors in DMESG about having issues accessing it.</div><div><br></div><div>When i unmount the filesystem and run xfs_repair on it, no error&#39;s/issues are reported. Also when i mount the filesystem again, it&#39;s reported as a clean mount;</div><div>[2478552.169540] XFS (dm-23): Mounting V5 Filesystem<br>[2478552.180645] XFS (dm-23): Ending clean mount<br></div><div><br></div><div>When i kill the brick process and start with &quot;gluser v start x force&quot; the issue seems much more unlikely to occur, but when started from a fresh reboot, or when killing the process and let it being started by glusterd (e.g. service glusterd start) the error seems to arise after a couple of minutes. </div><div><br></div><div>I am making use of LVM cache (in write through mode), maybe that&#39;s related. Also the disks it self are backed by a hardware raid controller and i did inspect all disks for SMART errors.</div><div><br></div><div>Does anybody has experience with this, and a clue on what might causing this?</div><div><br></div><div>Thanks Olaf</div><div><br></div><div><br></div><div><br></div><div><br></div></div>