[Gluster-users] self heal errors on 3.1.1 clients

David Lloyd david.lloyd at v-consultants.co.uk
Thu Jan 27 22:03:17 UTC 2011


Yes, it seemed really dangerous to me too. But with the lack of
documentation, and lack of response from gluster (and the data is still on
the old system too), I thought I'd give it a shot.

Thanks for the explanation. The split-brain problem seems to come up fairly
regularly, but I've not found any clear explanation of what to do in this
situation. I'm starting to worry about what appears to be a rationing of
information from gluster.com to the the community at large.

We're not in a position to purchase support, and I'm a sysadmin, not a
developer. I hope to make a contribution in terms of testing and feedback
and bug reports, but I'm seeing a lot of threads that seem to go nowhere,
and it's getting a bit frustrating.

David



> This seems really dangerous to me.  On a brick xxx, the trusted.afr.yyy
> attribute consists of three unsigned 32-bit counters, indicating how many
> uncommitted operations (data, metadata, and namespace respectively) might
> exist at yyy.  If xxx shows uncommitted operations at yyy but not vice
> versa, then we know that xxx is more up to date and it should be the source
> for self-heal.  If two bricks show uncommitted operations at each other,
> then we're in the infamous "split brain" scenario.  Some client was unable
> to clear the counter at xxx while another was unable to clear it at yyy, or
> both xxx and yyy went down after the operation was complete but before they
> could clear the counters for each other.
>
> In this case, it looks like a metadata operation (permission change) was in
> this state.  If the permissions are in fact the same both places then it
> doesn't matter which way self-heal happens, or whether it happens at all.
>  In fact, it seems to me that AFR should be able to detect this particular
> condition and not flag it as an error.  In any case, I think you're probably
> fine in this case but in general it's a very bad idea to clear these flags
> manually because it can cause updates to be lost (if self-heal goes the
> wrong way) or files to remain in an inconsistent state (if no self-heal
> occurs).
>
> The real thing I'd wonder about is why both servers are so frequently
> becoming unavailable at the same instant (switch problem?) and why
> permission changes on the root are apparently so frequent that this ofen
> results in a split-brain.
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>



-- 
David Lloyd
V Consultants
www.v-consultants.co.uk
tel: +44 7983 816501
skype: davidlloyd1243


More information about the Gluster-users mailing list