[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Erik Jacobson erik.jacobson at hpe.com
Thu Apr 2 07:02:46 UTC 2020


> Hmm, afr_inode_refresh_done() is called with error=0 and by the time we
> reach afr_txn_refresh_done(), it becomes 5(i.e. EIO).
> So afr_inode_refresh_done() is changing it to 5. Maybe you can put
> breakpoints/ log messages in afr_inode_refresh_done() at the places where
> error is getting changed and see where the assignment happens.

I had a lot of struggles tonight getting the system ready to go.  I had
seg11's in glusterfs(nfs) but I think it was related to not all brick
processes stopping with glusterd. I also re-installed and/or the print
statements. I'm not sure. I'm not used to seeing that.

I put print statements everywhere I thought error could change and got
no printed log messages.

I put break points where error would change and we didn't hit them.

I then point a breakpoint at

break xlators/cluster/afr/src/afr-common.c:1298 if error != 0

---> refresh_done:
    afr_txn_refresh_done(frame, this, error);

And it never triggered (despite split-brain messages and my crapola
message).

So I'm unable to explain this transition. I'm also not a gdb expert.
I still see the same back trace though.

#1  0x00007fff68938d7b in afr_txn_refresh_done (
    frame=frame at entry=0x7ffd540ed8e8, this=this at entry=0x7fff64013720, err=5,
    err at entry=0) at afr-common.c:1222
#2  0x00007fff689391f0 in afr_inode_refresh_done (
    frame=frame at entry=0x7ffd540ed8e8, this=this at entry=0x7fff64013720, error=0)
    at afr-common.c:1299

Is there other advice you might have for me to try?

I'm very eager to solve this problem, which is why I'm staying up late
to get machine time. I must go to bed now. I look forward to another
shot tomorrow night if you have more ideas to try.

Erik


More information about the Gluster-users mailing list