[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Tue Mar 31 09:57:59 UTC 2020

 From your reply in the other thread, I'm assuming that the file/gfid in 
question is not in genuine split-brain or needing heal. i.e. for example 
with that 1 brick down and 2 bricks up test case, if you tried to read 
the file from say a temporary fuse mount (which is also now connected to 
only to 2 bricks since the 3rd one is down) it works fine and there is 
no EIO error...

...which means that what you have observed is true, i.e. 
afr_read_txn_refresh_done() is called with err=EIO. You can add logs to 
see at what point it is EIO set. The call graph is like this: 
afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done().

Maybe 
https://github.com/gluster/glusterfs/blob/v7.4/xlators/cluster/afr/src/afr-common.c#L1188 
in afr_txn_refresh_done() is causing it either due to ret being -EIO or 
event_generation being zero.

If you are comfortable with gdb, you an put a conditional break point in 
afr_read_txn_refresh_done() at 
https://github.com/gluster/glusterfs/blob/v7.4/xlators/cluster/afr/src/afr-read-txn.c#L283 
when err=EIO and then check the backtrace for who is setting err to EIO.

Regards,
Ravi
On 31/03/20 12:20 pm, Erik Jacobson wrote:
> I note that this part of  afr_read_txn() gets triggered a lot.
>
>      if (afr_is_inode_refresh_reqd(inode, this, local->event_generation,
>                                    event_generation)) {
>
> Maybe that's normal when one of the three servers are down (but why
> isn't it using its local copy by default?)
>
> The comment in that if block is:
>          /* servers have disconnected / reconnected, and possibly
>             rebooted, very likely changing the state of freshness
>             of copies */
>
> But we have one server conssitently down, not a changing situation.
>
> digging digging digging seemed to show this related to cache
> invalidation.... Because the paths seemed to suggest the inode needed
> refreshing and that seems handled by a case statement named
> GF_UPCALL_CACHE_INVALIDATION
>
> However, that must have been a wrong turn since turning off
> cache invalidation didn't help.
>
> I'm struggling to wrap my head around the code base and without the
> background in these concepts it's a tough hill to climb.
>
> I am going to have to try this again some day with fresh eyes and go to
> bed; the machine I have easy access to is going away in the morning.
> Now I'll have to reserve time on a contended one but I will do that and
> continue digging.
>
> Any suggestions would be greatly appreciated as I think I'm starting to
> tip over here on this one.
>
>
> On Mon, Mar 30, 2020 at 04:04:39PM -0500, Erik Jacobson wrote:
>>> Sadly I am not a  developer,  so I can't answer your questions.
>> I'm not a FS o rnetwork developer either. I think there is a joke about
>> playing one on TV but maybe it's netflix now.
>>
>> Enabling certain debug options made too much information for me to watch
>> personally (but an expert could probably get through it).
>>
>> So I started putting targeted 'print' (gf_msg) statements in the code to
>> see how it got its way to split-brain. Maybe this will ring a bell
>> for someone.
>>
>> I can tell the only way we enter the split-brain path is through in the
>> first if statement of afr_read_txn_refresh_done().
>>
>> This means afr_read_txn_refresh_done() itself was passed "err" and
>> that it appears thin_arbiter_count was not set (which makes sense,
>> I'm using 1x3, not a thin arbiter).
>>
>> So we jump to the readfn label, and read_subvol() should still be -1.
>> If I read right, it must mean that this if didn't return true because
>> my print statement didn't appear:
>> if ((ret == 0) && spb_choice >= 0) {
>>
>> So we're still with the original read_subvol == 1,
>> Which gets us to the split_brain message.
>>
>> So now I will try to learn why afr_read_txn_refresh_done() would have
>> 'err' set in the first place. I will also learn about
>> afr_inode_split_brain_choice_get(). Those seem to be the two methods to
>> have avoided falling in to the split brain hole here.
>>
>>
>> I put debug statements in these locations. I will mark with !!!!!! what
>> I see:
>>
>>
>>
>> diff -Narup glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c
>> --- glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c	2020-01-15 11:43:53.887894293 -0600
>> +++ glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c	2020-03-30 15:45:02.917104321 -0500
>> @@ -279,10 +279,14 @@ afr_read_txn_refresh_done(call_frame_t *
>>       priv = this->private;
>>
>>       if (err) {
>> -        if (!priv->thin_arbiter_count)
>> +        if (!priv->thin_arbiter_count) {
>> +            gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg crapola 1st if in afr_read_txn_refresh_done() !priv->thin_arbiter_count -- goto to readfn");
>> !!!!!!!!!!!!!!!!!!!!!!
>> We hit this error condition and jump to readfn below
>> !!!!!!!!!!!!!!!!!!!!!!!
>>               goto readfn;
>> -        if (err != EINVAL)
>> +        }
>> +        if (err != EINVAL) {
>> +            gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj 2nd if in afr_read_txn_refresh_done() err != EINVAL, goto readfn");
>>               goto readfn;
>> +        }
>>           /* We need to query the good bricks and/or thin-arbiter.*/
>>           afr_ta_read_txn_synctask(frame, this);
>>           return 0;
>> @@ -291,6 +295,8 @@ afr_read_txn_refresh_done(call_frame_t *
>>       read_subvol = afr_read_subvol_select_by_policy(inode, this, local->readable,
>>                                                      NULL);
>>       if (read_subvol == -1) {
>> +        gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg whoops read_subvol returned -1, going to readfn");
>> +
>>           err = EIO;
>>           goto readfn;
>>       }
>> @@ -304,11 +310,15 @@ afr_read_txn_refresh_done(call_frame_t *
>>   readfn:
>>       if (read_subvol == -1) {
>>           ret = afr_inode_split_brain_choice_get(inode, this, &spb_choice);
>> -        if ((ret == 0) && spb_choice >= 0)
>> +        if ((ret == 0) && spb_choice >= 0) {
>> !!!!!!!!!!!!!!!!!!!!!!
>> We never get here, afr_inode_split_brain_choice_get() must not have
>> returned what was needed to enter.
>> !!!!!!!!!!!!!!!!!!!!!!
>> +            gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg read_subvol was -1 to begin with split brain choice found: %d", spb_choice);
>>               read_subvol = spb_choice;
>> +        }
>>       }
>>
>>       if (read_subvol == -1) {
>> +       gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg verify this shows up above split-brain error");
>> !!!!!!!!!!!!!!!!!!!!!!
>> We hit here. Game over player.
>> !!!!!!!!!!!!!!!!!!!!!!
>> +
>>           AFR_SET_ERROR_AND_CHECK_SPLIT_BRAIN(-1, err);
>>       }
>>       afr_read_txn_wind(frame, this, read_subvol);
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>