[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Mon Mar 30 04:22:44 UTC 2020

On 29/03/20 9:40 am, Erik Jacobson wrote:
> Hello all,
>
> I am getting split-brain errors in the gnfs nfs.log when 1 gluster
> server is down in a 3-brick/3-node gluster volume. It only happens under
> intense load.
>
> In the lab, I have a test case that can repeat the problem on a single
> subvolume cluster.
>
>   If all leaders are up, we see no errors.
>
>
> Here are example nfs.log errors:
>
>
> [2020-03-29 03:42:52.295532] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. [Input/output error]
>
Since you say that the errors go away when all 3 bricks (which I guess 
is what you refer to as 'leaders') of the replica are up, it could be 
possible that the brick you brought down had the only good copy. In such 
cases, even though you have the other 2 bricks of the replica up, they 
both are bad copies waiting to be healed and hence all operations on 
those files will fail with EIO. Since you say this occurs under high 
load only. I suspect this is the case since heal hasn't had the time to 
catch up with the nodes going up and down.

If you see the split-brain errors despite all 3 replica bricks being 
online and the gnfs server being able to connect to all of them, then it 
could be a genuine split-brain problem. But I don't think that is the 
case here.

Regards,
Ravi