[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Mon Mar 30 13:53:35 UTC 2020

Thank you so much for replying --

> > [2020-03-29 03:42:52.295532] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. [Input/output error]

> Since you say that the errors go away when all 3 bricks (which I guess is
> what you refer to as 'leaders') of the replica are up, it could be possible

Yes leaders == gluster+gnfs server for this. We use 'leader' internally
for mean servers that help manage compute nodes. I try to convert it to
'server' in my writing but 'leader' slips out somtimes.

> that the brick you brought down had the only good copy. In such cases, even
> though you have the other 2 bricks of the replica up, they both are bad

I think all 3 copies are good. That is because the same exact files are
accessed the same way when nodes boot. With one server down, 76 nodes
normally boot with no errors. Once in a while one fails with split brain
errors in the log. The more load I put in, the more likely a split brain
when one server is down. So that's why my test case is so weird looking.
It has to generate a bunch of extra load and then try to access root
filesystem files using our tools to trigger the split brain. The test
is good in that it produces at least a couple slit-brain errors every
time. I'm actually ver happy to have a test case. We've been dealing
with reports for some time.

The healing errors seen are explained by the writable XFS image files in
gluster -- one per node -- that the nodes use for their /etc, /var, and
so on. So the 76 healing messages were expected. If it would help to
reduce confusion, I can repeat the test with using TMPFS for the
writable areas so that the healing list is clear.

> copies waiting to be healed and hence all operations on those files will
> fail with EIO. Since you say this occurs under high load only. I suspect

To be clear, with one server down, operations work like 99.9% of the time.
Same operations on every node. It's only when we bring the load up
(maybe heavy metadata related?) do we get split-brain errors with one
server down.

It is a strange problem but I don't believe there is a problem with any
copy of any file. Never say never and nothing would make me happier than
being wrong and solving the problem.

I want to thank you so much for writing back. I'm willing to try any
suggestions we come up with.

Erik