[Gluster-users] self heal errors on 3.1.1 clients
Jeff Darcy
jdarcy at redhat.com
Thu Jan 27 14:01:11 UTC 2011
On 01/26/2011 07:25 PM, David Lloyd wrote:
> Well, I did this and it seems to have worked. I was just guessing really,
> didn't have any documentation or advice from anyone in the know.
>
> I just reset the attributes on the root directory for each brick that was
> not all zeroes.
>
> I found it easier to dump the attributes without the '-e hex'
>
> g4:~ # getfattr -d -m trusted.afr /mnt/glus1 /mnt/glus2
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/glus1
> trusted.afr.glustervol1-client-2=0sAAAAAAAAAAEAAAAA
> trusted.afr.glustervol1-client-3=0sAAAAAAAAAAAAAAAA
>
> Then
> setfattr -n trusted.afr.glustervol1-client-2 -v 0sAAAAAAAAAAAAAAAA
> /mnt/glus1
>
> I did that on all the bricks that didn't have all A's
>
> next time i stat-ed the root of the filesystem on the client the self heal
> worked ok.
>
> I'm not comfortable advising you to do this as I'm really feeling my way
> here, but it looks as though it worked for me.
This seems really dangerous to me. On a brick xxx, the trusted.afr.yyy
attribute consists of three unsigned 32-bit counters, indicating how
many uncommitted operations (data, metadata, and namespace respectively)
might exist at yyy. If xxx shows uncommitted operations at yyy but not
vice versa, then we know that xxx is more up to date and it should be
the source for self-heal. If two bricks show uncommitted operations at
each other, then we're in the infamous "split brain" scenario. Some
client was unable to clear the counter at xxx while another was unable
to clear it at yyy, or both xxx and yyy went down after the operation
was complete but before they could clear the counters for each other.
In this case, it looks like a metadata operation (permission change) was
in this state. If the permissions are in fact the same both places then
it doesn't matter which way self-heal happens, or whether it happens at
all. In fact, it seems to me that AFR should be able to detect this
particular condition and not flag it as an error. In any case, I think
you're probably fine in this case but in general it's a very bad idea to
clear these flags manually because it can cause updates to be lost (if
self-heal goes the wrong way) or files to remain in an inconsistent
state (if no self-heal occurs).
The real thing I'd wonder about is why both servers are so frequently
becoming unavailable at the same instant (switch problem?) and why
permission changes on the root are apparently so frequent that this ofen
results in a split-brain.
More information about the Gluster-users
mailing list