[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Erik Jacobson erik.jacobson at hpe.com
Wed Apr 15 15:39:34 UTC 2020


After several successful runs of the test case, we thought we were
solved. Indeed, split-brain is gone.

But we're triggering a seg fault now, even in a less loaded case.

We're going to switch to gluster74, which was your intention, and report
back.

On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote:
> > Attached the wrong patch by mistake in my previous mail. Sending the correct
> > one now.
> 
> Early results loook GREAT !!
> 
> We'll keep beating on it. We applied it to glsuter72 as that is what we
> have to ship with. It applied fine with some line moves.
> 
> If you would like us to also run a test with gluster74 so that you can
> say that's tested, we can run that test. I can do a special build.
> 
> THANK YOU!!
> 
> > 
> > 
> > -Ravi
> > 
> > 
> > On 15/04/20 2:05 pm, Ravishankar N wrote:
> > 
> > 
> >     On 10/04/20 2:06 am, Erik Jacobson wrote:
> > 
> >         Once again thanks for sticking with us. Here is a reply from Scott
> >         Titus. If you have something for us to try, we'd love it. The code had
> >         your patch applied when gdb was run:
> > 
> > 
> >         Here is the addr2line output for those addresses.  Very interesting
> >         command, of
> >         which I was not aware.
> > 
> >         [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> >         cluster/
> >         afr.so 0x6f735
> >         afr_lookup_metadata_heal_check
> >         afr-common.c:2803
> >         [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> >         cluster/
> >         afr.so 0x6f0b9
> >         afr_lookup_done
> >         afr-common.c:2455
> >         [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> >         cluster/
> >         afr.so 0x5c701
> >         afr_inode_event_gen_reset
> >         afr-common.c:755
> > 
> > 
> >     Right, so afr_lookup_done() is resetting the event gen to zero. This looks
> >     like a race between lookup and inode refresh code paths. We made some
> >     changes to the event generation logic in AFR. Can you apply the attached
> >     patch and see if it fixes the split-brain issue? It should apply cleanly on
> >     glusterfs-7.4.
> > 
> >     Thanks,
> >     Ravi
> > 
> >    
> >     ________
> > 
> > 
> > 
> >     Community Meeting Calendar:
> > 
> >     Schedule -
> >     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> >     Bridge: https://bluejeans.com/441850968
> > 
> >     Gluster-users mailing list
> >     Gluster-users at gluster.org
> >     https://lists.gluster.org/mailman/listinfo/gluster-users
> > 
> 
> > >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001
> > From: Ravishankar N <ravishankar at redhat.com>
> > Date: Wed, 15 Apr 2020 13:53:26 +0530
> > Subject: [PATCH] afr: event gen changes
> > 
> > The general idea of the changes is to prevent resetting event generation
> > to zero in the inode ctx, since event gen is something that should
> > follow 'causal order'.
> > 
> > Change #1:
> > For a read txn, in inode refresh cbk, if event_generation is
> > found zero, we are failing the read fop. This is not needed
> > because change in event gen is only a marker for the next inode refresh to
> > happen and should not be taken into account by the current read txn.
> > 
> > Change #2:
> > The event gen being zero above can happen if there is a racing lookup,
> > which resets even get (in afr_lookup_done) if there are non zero afr
> > xattrs. The resetting is done only to trigger an inode refresh and a
> > possible client side heal on the next lookup. That can be acheived by
> > setting the need_refresh flag in the inode ctx. So replaced all
> > occurences of resetting even gen to zero with a call to
> > afr_inode_need_refresh_set().
> > 
> > Change #3:
> > In both lookup and discover path, we are doing an inode refresh which is
> > not required since all 3 essentially do the same thing- update the inode
> > ctx with the good/bad copies from the brick replies. Inode refresh also
> > triggers background heals, but I think it is okay to do it when we call
> > refresh during the read and write txns and not in the lookup path.
> > 
> > Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
> > Signed-off-by: Ravishankar N <ravishankar at redhat.com>
> > ---
> >  ...ismatch-resolution-with-fav-child-policy.t |  8 +-
> >  xlators/cluster/afr/src/afr-common.c          | 92 ++++---------------
> >  xlators/cluster/afr/src/afr-dir-write.c       |  6 +-
> >  xlators/cluster/afr/src/afr.h                 |  5 +-
> >  4 files changed, 29 insertions(+), 82 deletions(-)
> > 
> > diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> > index f4aa351e4..12af0c854 100644
> > --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> > +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> > @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ]
> >  #We know that second brick has the bigger size file
> >  BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\  -f1)
> >  
> > -TEST ls $M0/f3
> > -TEST cat $M0/f3
> > +TEST ls $M0 #Trigger entry heal via readdir inode refresh
> > +TEST cat $M0/f3 #Trigger data heal via readv inode refresh
> >  EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
> >  
> >  #gfid split-brain should be resolved
> > @@ -215,8 +215,8 @@ TEST $CLI volume start $V0 force
> >  EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status $V0 $H0 $B0/${V0}2
> >  EXPECT_WITHIN $CHILD_UP_TIMEOUT "1" afr_child_up_status $V0 2
> >  
> > -TEST ls $M0/f4
> > -TEST cat $M0/f4
> > +TEST ls $M0 #Trigger entry heal via readdir inode refresh
> > +TEST cat $M0/f4  #Trigger data heal via readv inode refresh
> >  EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
> >  
> >  #gfid split-brain should be resolved
> > diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
> > index 61f21795e..319665a14 100644
> > --- a/xlators/cluster/afr/src/afr-common.c
> > +++ b/xlators/cluster/afr/src/afr-common.c
> > @@ -282,7 +282,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local,
> >                  metadatamap |= (1 << index);
> >              }
> >              if (metadatamap_old != metadatamap) {
> > -                event = 0;
> > +                __afr_inode_need_refresh_set(inode, this);
> >              }
> >              break;
> >  
> > @@ -295,7 +295,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local,
> >                  datamap |= (1 << index);
> >              }
> >              if (datamap_old != datamap)
> > -                event = 0;
> > +                __afr_inode_need_refresh_set(inode, this);
> >              break;
> >  
> >          default:
> > @@ -458,34 +458,6 @@ out:
> >      return ret;
> >  }
> >  
> > -int
> > -__afr_inode_event_gen_reset_small(inode_t *inode, xlator_t *this)
> > -{
> > -    int ret = -1;
> > -    uint16_t datamap = 0;
> > -    uint16_t metadatamap = 0;
> > -    uint32_t event = 0;
> > -    uint64_t val = 0;
> > -    afr_inode_ctx_t *ctx = NULL;
> > -
> > -    ret = __afr_inode_ctx_get(this, inode, &ctx);
> > -    if (ret)
> > -        return ret;
> > -
> > -    val = ctx->read_subvol;
> > -
> > -    metadatamap = (val & 0x000000000000ffff) >> 0;
> > -    datamap = (val & 0x00000000ffff0000) >> 16;
> > -    event = 0;
> > -
> > -    val = ((uint64_t)metadatamap) | (((uint64_t)datamap) << 16) |
> > -          (((uint64_t)event) << 32);
> > -
> > -    ctx->read_subvol = val;
> > -
> > -    return ret;
> > -}
> > -
> >  int
> >  __afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data,
> >                              unsigned char *metadata, int *event_p)
> > @@ -556,22 +528,6 @@ out:
> >      return ret;
> >  }
> >  
> > -int
> > -__afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
> > -{
> > -    afr_private_t *priv = NULL;
> > -    int ret = -1;
> > -
> > -    priv = this->private;
> > -
> > -    if (priv->child_count <= 16)
> > -        ret = __afr_inode_event_gen_reset_small(inode, this);
> > -    else
> > -        ret = -1;
> > -
> > -    return ret;
> > -}
> > -
> >  int
> >  afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data,
> >                            unsigned char *metadata, int *event_p)
> > @@ -721,30 +677,22 @@ out:
> >      return need_refresh;
> >  }
> >  
> > -static int
> > -afr_inode_need_refresh_set(inode_t *inode, xlator_t *this)
> > +int
> > +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this)
> >  {
> >      int ret = -1;
> >      afr_inode_ctx_t *ctx = NULL;
> >  
> > -    GF_VALIDATE_OR_GOTO(this->name, inode, out);
> > -
> > -    LOCK(&inode->lock);
> > -    {
> > -        ret = __afr_inode_ctx_get(this, inode, &ctx);
> > -        if (ret)
> > -            goto unlock;
> > -
> > +    ret = __afr_inode_ctx_get(this, inode, &ctx);
> > +    if (ret == 0) {
> >          ctx->need_refresh = _gf_true;
> >      }
> > -unlock:
> > -    UNLOCK(&inode->lock);
> > -out:
> > +
> >      return ret;
> >  }
> >  
> >  int
> > -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
> > +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this)
> >  {
> >      int ret = -1;
> >  
> > @@ -754,7 +702,7 @@ afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
> >                       "Resetting event gen for %s", uuid_utoa(inode->gfid));
> >      LOCK(&inode->lock);
> >      {
> > -        ret = __afr_inode_event_gen_reset(inode, this);
> > +        ret = __afr_inode_need_refresh_set(inode, this);
> >      }
> >      UNLOCK(&inode->lock);
> >  out:
> > @@ -1187,7 +1135,7 @@ afr_txn_refresh_done(call_frame_t *frame, xlator_t *this, int err)
> >      ret = afr_inode_get_readable(frame, inode, this, local->readable,
> >                                   &event_generation, local->transaction.type);
> >  
> > -    if (ret == -EIO || (local->is_read_txn && !event_generation)) {
> > +    if (ret == -EIO) {
> >          /* No readable subvolume even after refresh ==> splitbrain.*/
> >          if (!priv->fav_child_policy) {
> >              err = EIO;
> > @@ -2451,7 +2399,7 @@ afr_lookup_done(call_frame_t *frame, xlator_t *this)
> >          if (read_subvol == -1)
> >              goto cant_interpret;
> >          if (ret) {
> > -            afr_inode_event_gen_reset(local->inode, this);
> > +            afr_inode_need_refresh_set(local->inode, this);
> >              dict_del_sizen(local->replies[read_subvol].xdata, GF_CONTENT_KEY);
> >          }
> >      } else {
> > @@ -3007,6 +2955,7 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this)
> >      afr_private_t *priv = NULL;
> >      afr_local_t *local = NULL;
> >      int read_subvol = -1;
> > +    int ret = 0;
> >      unsigned char *data_readable = NULL;
> >      unsigned char *success_replies = NULL;
> >  
> > @@ -3028,7 +2977,10 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this)
> >      if (!afr_has_quorum(success_replies, this, frame))
> >          goto unwind;
> >  
> > -    afr_replies_interpret(frame, this, local->inode, NULL);
> > +    ret = afr_replies_interpret(frame, this, local->inode, NULL);
> > +    if (ret) {
> > +        afr_inode_need_refresh_set(local->inode, this);
> > +    }
> >  
> >      read_subvol = afr_read_subvol_decide(local->inode, this, NULL,
> >                                           data_readable);
> > @@ -3284,11 +3236,7 @@ afr_discover(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req)
> >      afr_read_subvol_get(loc->inode, this, NULL, NULL, &event,
> >                          AFR_DATA_TRANSACTION, NULL);
> >  
> > -    if (afr_is_inode_refresh_reqd(loc->inode, this, event,
> > -                                  local->event_generation))
> > -        afr_inode_refresh(frame, this, loc->inode, NULL, afr_discover_do);
> > -    else
> > -        afr_discover_do(frame, this, 0);
> > +    afr_discover_do(frame, this, 0);
> >  
> >      return 0;
> >  out:
> > @@ -3429,11 +3377,7 @@ afr_lookup(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req)
> >      afr_read_subvol_get(loc->parent, this, NULL, NULL, &event,
> >                          AFR_DATA_TRANSACTION, NULL);
> >  
> > -    if (afr_is_inode_refresh_reqd(loc->inode, this, event,
> > -                                  local->event_generation))
> > -        afr_inode_refresh(frame, this, loc->parent, NULL, afr_lookup_do);
> > -    else
> > -        afr_lookup_do(frame, this, 0);
> > +    afr_lookup_do(frame, this, 0);
> >  
> >      return 0;
> >  out:
> > diff --git a/xlators/cluster/afr/src/afr-dir-write.c b/xlators/cluster/afr/src/afr-dir-write.c
> > index 82a72fddd..333085b14 100644
> > --- a/xlators/cluster/afr/src/afr-dir-write.c
> > +++ b/xlators/cluster/afr/src/afr-dir-write.c
> > @@ -119,11 +119,11 @@ __afr_dir_write_finalize(call_frame_t *frame, xlator_t *this)
> >              continue;
> >          if (local->replies[i].op_ret < 0) {
> >              if (local->inode)
> > -                afr_inode_event_gen_reset(local->inode, this);
> > +                afr_inode_need_refresh_set(local->inode, this);
> >              if (local->parent)
> > -                afr_inode_event_gen_reset(local->parent, this);
> > +                afr_inode_need_refresh_set(local->parent, this);
> >              if (local->parent2)
> > -                afr_inode_event_gen_reset(local->parent2, this);
> > +                afr_inode_need_refresh_set(local->parent2, this);
> >              continue;
> >          }
> >  
> > diff --git a/xlators/cluster/afr/src/afr.h b/xlators/cluster/afr/src/afr.h
> > index a3f2942b3..ed6d777c1 100644
> > --- a/xlators/cluster/afr/src/afr.h
> > +++ b/xlators/cluster/afr/src/afr.h
> > @@ -958,7 +958,10 @@ afr_inode_read_subvol_set(inode_t *inode, xlator_t *this,
> >                            int event_generation);
> >  
> >  int
> > -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this);
> > +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this);
> > +
> > +int
> > +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this);
> >  
> >  int
> >  afr_read_subvol_select_by_policy(inode_t *inode, xlator_t *this,
> > -- 
> > 2.25.2
> > 
> 
> 
> 
> Erik Jacobson
> Software Engineer
> 
> erik.jacobson at hpe.com
> +1 612 851 0550 Office
> 
> Eagan, MN
> hpe.com


Erik Jacobson
Software Engineer

erik.jacobson at hpe.com
+1 612 851 0550 Office

Eagan, MN
hpe.com


More information about the Gluster-users mailing list