[Gluster-users] How to fix I/O error ? (resend)

Diego Zuccato diego.zuccato at unibo.it
Fri Aug 21 11:56:17 UTC 2020

Hello all.

I have a volume setup as:
root at str957-biostor:~# gluster v info BigVol

Volume Name: BigVol
Type: Distributed-Replicate
Volume ID: c51926bd-6715-46b2-8bb3-8c915ec47e28
Status: Started
Snapshot Count: 0
Number of Bricks: 28 x (2 + 1) = 84
Transport-type: tcp
Brick1: str957-biostor2:/srv/bricks/00/BigVol
Brick2: str957-biostor:/srv/bricks/00/BigVol
Brick3: str957-biostq:/srv/arbiters/00/BigVol (arbiter)
Options Reconfigured:
cluster.granular-entry-heal: enable
client.event-threads: 8
server.event-threads: 8
server.ssl: on
client.ssl: on
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
features.bitrot: on
features.scrub: Active
features.scrub-freq: biweekly
auth.ssl-allow: str957-bio*
ssl.certificate-depth: 1
cluster.self-heal-daemon: enable
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
server.manage-gids: on
features.scrub-throttle: aggressive

After a couple failures (a disk on biostor2 went "missing", and glusterd
on biostq got killed by OOM) I noticed that some files can't be accessed
from the clients:
$ ls -lh 1_germline_CGTACTAG_L005_R*
-rwxr-xr-x 1 e.f domain^users 2,0G apr 24  2015
-rwxr-xr-x 1 e.f domain^users 2,0G apr 24  2015
$ ls -lh 1_germline_CGTACTAG_L005_R1_001.fastq.gz
ls: cannot access '1_germline_CGTACTAG_L005_R1_001.fastq.gz':
Input/output error
(note that if I request ls for more files, it works...).

The files have exactly the same contents (verified via md5sum). The only
difference is in getfattr: trusted.bit-rot.version is
0x17000000000000005f3f9e670002ad5b on a node and
0x12000000000000005f3ce7af000dccad on the other.

On the client, the log reports:
[2020-08-21 11:32:52.208809] W [MSGID: 108008]
4-BigVol-replicate-13: GFID mismatch for
d70a4a6d-05fc-4988-8041-5e7f62155fe5 on BigVol-client-55 and
f249f88a-909f-489d-8d1d-d428e842ee96 on BigVol-client-34
[2020-08-21 11:32:52.209768] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 233606: LOOKUP()
/[...]/1_germline_CGTACTAG_L005_R1_001.fastq.gz => -1 (Errore di

As suggested on IRC, I tested the RAM, but the only thing I got have
been a "Peer rejected" status due to another OOM kill. No problem, I've
been able to resolve it, but the original problem still remains.

What else can I do?


