[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Erik Jacobson erik.jacobson at hpe.com
Wed Apr 15 15:33:01 UTC 2020


> Attached the wrong patch by mistake in my previous mail. Sending the correct
> one now.

Early results loook GREAT !!

We'll keep beating on it. We applied it to glsuter72 as that is what we
have to ship with. It applied fine with some line moves.

If you would like us to also run a test with gluster74 so that you can
say that's tested, we can run that test. I can do a special build.

THANK YOU!!

> 
> 
> -Ravi
> 
> 
> On 15/04/20 2:05 pm, Ravishankar N wrote:
> 
> 
>     On 10/04/20 2:06 am, Erik Jacobson wrote:
> 
>         Once again thanks for sticking with us. Here is a reply from Scott
>         Titus. If you have something for us to try, we'd love it. The code had
>         your patch applied when gdb was run:
> 
> 
>         Here is the addr2line output for those addresses.  Very interesting
>         command, of
>         which I was not aware.
> 
>         [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
>         cluster/
>         afr.so 0x6f735
>         afr_lookup_metadata_heal_check
>         afr-common.c:2803
>         [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
>         cluster/
>         afr.so 0x6f0b9
>         afr_lookup_done
>         afr-common.c:2455
>         [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
>         cluster/
>         afr.so 0x5c701
>         afr_inode_event_gen_reset
>         afr-common.c:755
> 
> 
>     Right, so afr_lookup_done() is resetting the event gen to zero. This looks
>     like a race between lookup and inode refresh code paths. We made some
>     changes to the event generation logic in AFR. Can you apply the attached
>     patch and see if it fixes the split-brain issue? It should apply cleanly on
>     glusterfs-7.4.
> 
>     Thanks,
>     Ravi
> 
>    
>     ________
> 
> 
> 
>     Community Meeting Calendar:
> 
>     Schedule -
>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     Bridge: https://bluejeans.com/441850968
> 
>     Gluster-users mailing list
>     Gluster-users at gluster.org
>     https://lists.gluster.org/mailman/listinfo/gluster-users
> 

> >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001
> From: Ravishankar N <ravishankar at redhat.com>
> Date: Wed, 15 Apr 2020 13:53:26 +0530
> Subject: [PATCH] afr: event gen changes
> 
> The general idea of the changes is to prevent resetting event generation
> to zero in the inode ctx, since event gen is something that should
> follow 'causal order'.
> 
> Change #1:
> For a read txn, in inode refresh cbk, if event_generation is
> found zero, we are failing the read fop. This is not needed
> because change in event gen is only a marker for the next inode refresh to
> happen and should not be taken into account by the current read txn.
> 
> Change #2:
> The event gen being zero above can happen if there is a racing lookup,
> which resets even get (in afr_lookup_done) if there are non zero afr
> xattrs. The resetting is done only to trigger an inode refresh and a
> possible client side heal on the next lookup. That can be acheived by
> setting the need_refresh flag in the inode ctx. So replaced all
> occurences of resetting even gen to zero with a call to
> afr_inode_need_refresh_set().
> 
> Change #3:
> In both lookup and discover path, we are doing an inode refresh which is
> not required since all 3 essentially do the same thing- update the inode
> ctx with the good/bad copies from the brick replies. Inode refresh also
> triggers background heals, but I think it is okay to do it when we call
> refresh during the read and write txns and not in the lookup path.
> 
> Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
> Signed-off-by: Ravishankar N <ravishankar at redhat.com>
> ---
>  ...ismatch-resolution-with-fav-child-policy.t |  8 +-
>  xlators/cluster/afr/src/afr-common.c          | 92 ++++---------------
>  xlators/cluster/afr/src/afr-dir-write.c       |  6 +-
>  xlators/cluster/afr/src/afr.h                 |  5 +-
>  4 files changed, 29 insertions(+), 82 deletions(-)
> 
> diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> index f4aa351e4..12af0c854 100644
> --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ]
>  #We know that second brick has the bigger size file
>  BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\  -f1)
>  
> -TEST ls $M0/f3
> -TEST cat $M0/f3
> +TEST ls $M0 #Trigger entry heal via readdir inode refresh
> +TEST cat $M0/f3 #Trigger data heal via readv inode refresh
>  EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
>  
>  #gfid split-brain should be resolved
> @@ -215,8 +215,8 @@ TEST $CLI volume start $V0 force
>  EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status $V0 $H0 $B0/${V0}2
>  EXPECT_WITHIN $CHILD_UP_TIMEOUT "1" afr_child_up_status $V0 2
>  
> -TEST ls $M0/f4
> -TEST cat $M0/f4
> +TEST ls $M0 #Trigger entry heal via readdir inode refresh
> +TEST cat $M0/f4  #Trigger data heal via readv inode refresh
>  EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
>  
>  #gfid split-brain should be resolved
> diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
> index 61f21795e..319665a14 100644
> --- a/xlators/cluster/afr/src/afr-common.c
> +++ b/xlators/cluster/afr/src/afr-common.c
> @@ -282,7 +282,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local,
>                  metadatamap |= (1 << index);
>              }
>              if (metadatamap_old != metadatamap) {
> -                event = 0;
> +                __afr_inode_need_refresh_set(inode, this);
>              }
>              break;
>  
> @@ -295,7 +295,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local,
>                  datamap |= (1 << index);
>              }
>              if (datamap_old != datamap)
> -                event = 0;
> +                __afr_inode_need_refresh_set(inode, this);
>              break;
>  
>          default:
> @@ -458,34 +458,6 @@ out:
>      return ret;
>  }
>  
> -int
> -__afr_inode_event_gen_reset_small(inode_t *inode, xlator_t *this)
> -{
> -    int ret = -1;
> -    uint16_t datamap = 0;
> -    uint16_t metadatamap = 0;
> -    uint32_t event = 0;
> -    uint64_t val = 0;
> -    afr_inode_ctx_t *ctx = NULL;
> -
> -    ret = __afr_inode_ctx_get(this, inode, &ctx);
> -    if (ret)
> -        return ret;
> -
> -    val = ctx->read_subvol;
> -
> -    metadatamap = (val & 0x000000000000ffff) >> 0;
> -    datamap = (val & 0x00000000ffff0000) >> 16;
> -    event = 0;
> -
> -    val = ((uint64_t)metadatamap) | (((uint64_t)datamap) << 16) |
> -          (((uint64_t)event) << 32);
> -
> -    ctx->read_subvol = val;
> -
> -    return ret;
> -}
> -
>  int
>  __afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data,
>                              unsigned char *metadata, int *event_p)
> @@ -556,22 +528,6 @@ out:
>      return ret;
>  }
>  
> -int
> -__afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
> -{
> -    afr_private_t *priv = NULL;
> -    int ret = -1;
> -
> -    priv = this->private;
> -
> -    if (priv->child_count <= 16)
> -        ret = __afr_inode_event_gen_reset_small(inode, this);
> -    else
> -        ret = -1;
> -
> -    return ret;
> -}
> -
>  int
>  afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data,
>                            unsigned char *metadata, int *event_p)
> @@ -721,30 +677,22 @@ out:
>      return need_refresh;
>  }
>  
> -static int
> -afr_inode_need_refresh_set(inode_t *inode, xlator_t *this)
> +int
> +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this)
>  {
>      int ret = -1;
>      afr_inode_ctx_t *ctx = NULL;
>  
> -    GF_VALIDATE_OR_GOTO(this->name, inode, out);
> -
> -    LOCK(&inode->lock);
> -    {
> -        ret = __afr_inode_ctx_get(this, inode, &ctx);
> -        if (ret)
> -            goto unlock;
> -
> +    ret = __afr_inode_ctx_get(this, inode, &ctx);
> +    if (ret == 0) {
>          ctx->need_refresh = _gf_true;
>      }
> -unlock:
> -    UNLOCK(&inode->lock);
> -out:
> +
>      return ret;
>  }
>  
>  int
> -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
> +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this)
>  {
>      int ret = -1;
>  
> @@ -754,7 +702,7 @@ afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
>                       "Resetting event gen for %s", uuid_utoa(inode->gfid));
>      LOCK(&inode->lock);
>      {
> -        ret = __afr_inode_event_gen_reset(inode, this);
> +        ret = __afr_inode_need_refresh_set(inode, this);
>      }
>      UNLOCK(&inode->lock);
>  out:
> @@ -1187,7 +1135,7 @@ afr_txn_refresh_done(call_frame_t *frame, xlator_t *this, int err)
>      ret = afr_inode_get_readable(frame, inode, this, local->readable,
>                                   &event_generation, local->transaction.type);
>  
> -    if (ret == -EIO || (local->is_read_txn && !event_generation)) {
> +    if (ret == -EIO) {
>          /* No readable subvolume even after refresh ==> splitbrain.*/
>          if (!priv->fav_child_policy) {
>              err = EIO;
> @@ -2451,7 +2399,7 @@ afr_lookup_done(call_frame_t *frame, xlator_t *this)
>          if (read_subvol == -1)
>              goto cant_interpret;
>          if (ret) {
> -            afr_inode_event_gen_reset(local->inode, this);
> +            afr_inode_need_refresh_set(local->inode, this);
>              dict_del_sizen(local->replies[read_subvol].xdata, GF_CONTENT_KEY);
>          }
>      } else {
> @@ -3007,6 +2955,7 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this)
>      afr_private_t *priv = NULL;
>      afr_local_t *local = NULL;
>      int read_subvol = -1;
> +    int ret = 0;
>      unsigned char *data_readable = NULL;
>      unsigned char *success_replies = NULL;
>  
> @@ -3028,7 +2977,10 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this)
>      if (!afr_has_quorum(success_replies, this, frame))
>          goto unwind;
>  
> -    afr_replies_interpret(frame, this, local->inode, NULL);
> +    ret = afr_replies_interpret(frame, this, local->inode, NULL);
> +    if (ret) {
> +        afr_inode_need_refresh_set(local->inode, this);
> +    }
>  
>      read_subvol = afr_read_subvol_decide(local->inode, this, NULL,
>                                           data_readable);
> @@ -3284,11 +3236,7 @@ afr_discover(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req)
>      afr_read_subvol_get(loc->inode, this, NULL, NULL, &event,
>                          AFR_DATA_TRANSACTION, NULL);
>  
> -    if (afr_is_inode_refresh_reqd(loc->inode, this, event,
> -                                  local->event_generation))
> -        afr_inode_refresh(frame, this, loc->inode, NULL, afr_discover_do);
> -    else
> -        afr_discover_do(frame, this, 0);
> +    afr_discover_do(frame, this, 0);
>  
>      return 0;
>  out:
> @@ -3429,11 +3377,7 @@ afr_lookup(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req)
>      afr_read_subvol_get(loc->parent, this, NULL, NULL, &event,
>                          AFR_DATA_TRANSACTION, NULL);
>  
> -    if (afr_is_inode_refresh_reqd(loc->inode, this, event,
> -                                  local->event_generation))
> -        afr_inode_refresh(frame, this, loc->parent, NULL, afr_lookup_do);
> -    else
> -        afr_lookup_do(frame, this, 0);
> +    afr_lookup_do(frame, this, 0);
>  
>      return 0;
>  out:
> diff --git a/xlators/cluster/afr/src/afr-dir-write.c b/xlators/cluster/afr/src/afr-dir-write.c
> index 82a72fddd..333085b14 100644
> --- a/xlators/cluster/afr/src/afr-dir-write.c
> +++ b/xlators/cluster/afr/src/afr-dir-write.c
> @@ -119,11 +119,11 @@ __afr_dir_write_finalize(call_frame_t *frame, xlator_t *this)
>              continue;
>          if (local->replies[i].op_ret < 0) {
>              if (local->inode)
> -                afr_inode_event_gen_reset(local->inode, this);
> +                afr_inode_need_refresh_set(local->inode, this);
>              if (local->parent)
> -                afr_inode_event_gen_reset(local->parent, this);
> +                afr_inode_need_refresh_set(local->parent, this);
>              if (local->parent2)
> -                afr_inode_event_gen_reset(local->parent2, this);
> +                afr_inode_need_refresh_set(local->parent2, this);
>              continue;
>          }
>  
> diff --git a/xlators/cluster/afr/src/afr.h b/xlators/cluster/afr/src/afr.h
> index a3f2942b3..ed6d777c1 100644
> --- a/xlators/cluster/afr/src/afr.h
> +++ b/xlators/cluster/afr/src/afr.h
> @@ -958,7 +958,10 @@ afr_inode_read_subvol_set(inode_t *inode, xlator_t *this,
>                            int event_generation);
>  
>  int
> -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this);
> +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this);
> +
> +int
> +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this);
>  
>  int
>  afr_read_subvol_select_by_policy(inode_t *inode, xlator_t *this,
> -- 
> 2.25.2
> 



Erik Jacobson
Software Engineer

erik.jacobson at hpe.com
+1 612 851 0550 Office

Eagan, MN
hpe.com


More information about the Gluster-users mailing list