[Gluster-users] Cluster not healing

Mon Jan 23 19:28:36 UTC 2017

Hello,

I have a couple of gluster clusters - setup with distributed/replicated
volumes that have starting incrementing the heal-count from statistics -
and for some files returning input/output error when attempting to access
said files from a fuse mount.

If i take one volume, from one cluster as an example:

gluster volume heal storage01 statistics info
<snip>
Brick storage02.<redacted>:/storage/sdc/brick_storage01
Number of entries: 595
</snip>

And then proceed to look at one of these files (have found 2 copies - one
on each server / brick)

First brick:

# getfattr -m . -d -e hex
 /storage/sdc/brick_storage01/projects/183-57c559ea4d60e-canary-test--node02/wordpress285-data/html/wp-content/themes/twentyfourteen/single.php
getfattr: Removing leading '/' from absolute path names
# file:
storage/sdc/brick_storage01/projects/183-57c559ea4d60e-canary-test--node02/wordpress285-data/html/wp-content/themes/twentyfourteen/single.php
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.storage01-client-0=0x000000020000000100000000
trusted.bit-rot.version=0x02000000000000005874e2cd0000459d
trusted.gfid=0xda4253be1c2647b7b6ec5c045d61d216
trusted.glusterfs.quota.c9764826-596a-4886-9bc0-60ee9b3fce44.contri.1=0x00000000000006000000000000000001
trusted.pgfid.c9764826-596a-4886-9bc0-60ee9b3fce44=0x00000001

Second Brick:

# getfattr -m . -d -e hex
/storage/sdc/brick_storage01/projects/183-57c559ea4d60e-canary-test--node02/wordpress285-data/html/wp-content/themes/twentyfourteen/single.php
getfattr: Removing leading '/' from absolute path names
# file:
storage/sdc/brick_storage01/projects/183-57c559ea4d60e-canary-test--node02/wordpress285-data/html/wp-content/themes/twentyfourteen/single.php
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000057868423000d6332
trusted.gfid=0x14f74b04679345289dbd3290a3665cbc
trusted.glusterfs.quota.47e007ee-6f91-4187-81f8-90a393deba2b.contri.1=0x00000000000006000000000000000001
trusted.pgfid.47e007ee-6f91-4187-81f8-90a393deba2b=0x00000001

I can see the only the first brick has the appropiate trusted.afr.<client>
tag - e.g in this case

trusted.afr.storage01-client-0=0x000000020000000100000000

Files are same size under stat - just the access/modify/change dates are
different.

My first question is - reading
https://gluster.readthedocs.io/en/latest/Troubleshooting/split-brain/ this
suggests that i should have this field on both copies of the files - or am
I mis-reading?

Secondly - am I correct that each one of these entries will require manual
fixing?  (I have approx 6K files/directories in this state over two
clusters - which appears like an awful lot of manual fixing)

I've checked gluster volume info <vol> and all appropiate
services/self-heal daemon are running.  We've even tried a full heal/heal
and iterating over parts of the filesystem in question with find / stat /
md5sum.

Any input appreciated.

Cheers,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170123/3be942fb/attachment.html>