[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Erik Jacobson erik.jacobson at hpe.com
Tue Mar 31 06:50:11 UTC 2020


I note that this part of  afr_read_txn() gets triggered a lot.

    if (afr_is_inode_refresh_reqd(inode, this, local->event_generation,
                                  event_generation)) {

Maybe that's normal when one of the three servers are down (but why
isn't it using its local copy by default?)

The comment in that if block is:
        /* servers have disconnected / reconnected, and possibly
           rebooted, very likely changing the state of freshness
           of copies */

But we have one server conssitently down, not a changing situation.

digging digging digging seemed to show this related to cache
invalidation.... Because the paths seemed to suggest the inode needed
refreshing and that seems handled by a case statement named
GF_UPCALL_CACHE_INVALIDATION

However, that must have been a wrong turn since turning off
cache invalidation didn't help.

I'm struggling to wrap my head around the code base and without the
background in these concepts it's a tough hill to climb.

I am going to have to try this again some day with fresh eyes and go to
bed; the machine I have easy access to is going away in the morning.
Now I'll have to reserve time on a contended one but I will do that and
continue digging.

Any suggestions would be greatly appreciated as I think I'm starting to
tip over here on this one.


On Mon, Mar 30, 2020 at 04:04:39PM -0500, Erik Jacobson wrote:
> > Sadly I am not a  developer,  so I can't answer your questions.
> 
> I'm not a FS o rnetwork developer either. I think there is a joke about
> playing one on TV but maybe it's netflix now.
> 
> Enabling certain debug options made too much information for me to watch
> personally (but an expert could probably get through it).
> 
> So I started putting targeted 'print' (gf_msg) statements in the code to
> see how it got its way to split-brain. Maybe this will ring a bell
> for someone.
> 
> I can tell the only way we enter the split-brain path is through in the
> first if statement of afr_read_txn_refresh_done().
> 
> This means afr_read_txn_refresh_done() itself was passed "err" and
> that it appears thin_arbiter_count was not set (which makes sense,
> I'm using 1x3, not a thin arbiter).
> 
> So we jump to the readfn label, and read_subvol() should still be -1.
> If I read right, it must mean that this if didn't return true because
> my print statement didn't appear:
> if ((ret == 0) && spb_choice >= 0) {
> 
> So we're still with the original read_subvol == 1,
> Which gets us to the split_brain message.
> 
> So now I will try to learn why afr_read_txn_refresh_done() would have
> 'err' set in the first place. I will also learn about
> afr_inode_split_brain_choice_get(). Those seem to be the two methods to
> have avoided falling in to the split brain hole here.
> 
> 
> I put debug statements in these locations. I will mark with !!!!!! what
> I see:
> 
> 
> 
> diff -Narup glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c
> --- glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c	2020-01-15 11:43:53.887894293 -0600
> +++ glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c	2020-03-30 15:45:02.917104321 -0500
> @@ -279,10 +279,14 @@ afr_read_txn_refresh_done(call_frame_t *
>      priv = this->private;
> 
>      if (err) {
> -        if (!priv->thin_arbiter_count)
> +        if (!priv->thin_arbiter_count) {
> +            gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg crapola 1st if in afr_read_txn_refresh_done() !priv->thin_arbiter_count -- goto to readfn");
> !!!!!!!!!!!!!!!!!!!!!!
> We hit this error condition and jump to readfn below
> !!!!!!!!!!!!!!!!!!!!!!!
>              goto readfn;
> -        if (err != EINVAL)
> +        }
> +        if (err != EINVAL) {
> +            gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj 2nd if in afr_read_txn_refresh_done() err != EINVAL, goto readfn");
>              goto readfn;
> +        }
>          /* We need to query the good bricks and/or thin-arbiter.*/
>          afr_ta_read_txn_synctask(frame, this);
>          return 0;
> @@ -291,6 +295,8 @@ afr_read_txn_refresh_done(call_frame_t *
>      read_subvol = afr_read_subvol_select_by_policy(inode, this, local->readable,
>                                                     NULL);
>      if (read_subvol == -1) {
> +        gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg whoops read_subvol returned -1, going to readfn");
> +
>          err = EIO;
>          goto readfn;
>      }
> @@ -304,11 +310,15 @@ afr_read_txn_refresh_done(call_frame_t *
>  readfn:
>      if (read_subvol == -1) {
>          ret = afr_inode_split_brain_choice_get(inode, this, &spb_choice);
> -        if ((ret == 0) && spb_choice >= 0)
> +        if ((ret == 0) && spb_choice >= 0) {
> !!!!!!!!!!!!!!!!!!!!!!
> We never get here, afr_inode_split_brain_choice_get() must not have
> returned what was needed to enter.
> !!!!!!!!!!!!!!!!!!!!!!
> +            gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg read_subvol was -1 to begin with split brain choice found: %d", spb_choice);
>              read_subvol = spb_choice;
> +        }
>      }
> 
>      if (read_subvol == -1) {
> +       gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg verify this shows up above split-brain error");
> !!!!!!!!!!!!!!!!!!!!!!
> We hit here. Game over player.
> !!!!!!!!!!!!!!!!!!!!!!
> +
>          AFR_SET_ERROR_AND_CHECK_SPLIT_BRAIN(-1, err);
>      }
>      afr_read_txn_wind(frame, this, read_subvol);




More information about the Gluster-users mailing list