[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
Erik Jacobson
erik.jacobson at hpe.com
Wed Apr 15 15:39:34 UTC 2020
After several successful runs of the test case, we thought we were
solved. Indeed, split-brain is gone.
But we're triggering a seg fault now, even in a less loaded case.
We're going to switch to gluster74, which was your intention, and report
back.
On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote:
> > Attached the wrong patch by mistake in my previous mail. Sending the correct
> > one now.
>
> Early results loook GREAT !!
>
> We'll keep beating on it. We applied it to glsuter72 as that is what we
> have to ship with. It applied fine with some line moves.
>
> If you would like us to also run a test with gluster74 so that you can
> say that's tested, we can run that test. I can do a special build.
>
> THANK YOU!!
>
> >
> >
> > -Ravi
> >
> >
> > On 15/04/20 2:05 pm, Ravishankar N wrote:
> >
> >
> > On 10/04/20 2:06 am, Erik Jacobson wrote:
> >
> > Once again thanks for sticking with us. Here is a reply from Scott
> > Titus. If you have something for us to try, we'd love it. The code had
> > your patch applied when gdb was run:
> >
> >
> > Here is the addr2line output for those addresses. Very interesting
> > command, of
> > which I was not aware.
> >
> > [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > cluster/
> > afr.so 0x6f735
> > afr_lookup_metadata_heal_check
> > afr-common.c:2803
> > [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > cluster/
> > afr.so 0x6f0b9
> > afr_lookup_done
> > afr-common.c:2455
> > [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > cluster/
> > afr.so 0x5c701
> > afr_inode_event_gen_reset
> > afr-common.c:755
> >
> >
> > Right, so afr_lookup_done() is resetting the event gen to zero. This looks
> > like a race between lookup and inode refresh code paths. We made some
> > changes to the event generation logic in AFR. Can you apply the attached
> > patch and see if it fixes the split-brain issue? It should apply cleanly on
> > glusterfs-7.4.
> >
> > Thanks,
> > Ravi
> >
> >
> > ________
> >
> >
> >
> > Community Meeting Calendar:
> >
> > Schedule -
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > Bridge: https://bluejeans.com/441850968
> >
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users
> >
>
> > >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001
> > From: Ravishankar N <ravishankar at redhat.com>
> > Date: Wed, 15 Apr 2020 13:53:26 +0530
> > Subject: [PATCH] afr: event gen changes
> >
> > The general idea of the changes is to prevent resetting event generation
> > to zero in the inode ctx, since event gen is something that should
> > follow 'causal order'.
> >
> > Change #1:
> > For a read txn, in inode refresh cbk, if event_generation is
> > found zero, we are failing the read fop. This is not needed
> > because change in event gen is only a marker for the next inode refresh to
> > happen and should not be taken into account by the current read txn.
> >
> > Change #2:
> > The event gen being zero above can happen if there is a racing lookup,
> > which resets even get (in afr_lookup_done) if there are non zero afr
> > xattrs. The resetting is done only to trigger an inode refresh and a
> > possible client side heal on the next lookup. That can be acheived by
> > setting the need_refresh flag in the inode ctx. So replaced all
> > occurences of resetting even gen to zero with a call to
> > afr_inode_need_refresh_set().
> >
> > Change #3:
> > In both lookup and discover path, we are doing an inode refresh which is
> > not required since all 3 essentially do the same thing- update the inode
> > ctx with the good/bad copies from the brick replies. Inode refresh also
> > triggers background heals, but I think it is okay to do it when we call
> > refresh during the read and write txns and not in the lookup path.
> >
> > Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
> > Signed-off-by: Ravishankar N <ravishankar at redhat.com>
> > ---
> > ...ismatch-resolution-with-fav-child-policy.t | 8 +-
> > xlators/cluster/afr/src/afr-common.c | 92 ++++---------------
> > xlators/cluster/afr/src/afr-dir-write.c | 6 +-
> > xlators/cluster/afr/src/afr.h | 5 +-
> > 4 files changed, 29 insertions(+), 82 deletions(-)
> >
> > diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> > index f4aa351e4..12af0c854 100644
> > --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> > +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> > @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ]
> > #We know that second brick has the bigger size file
> > BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\ -f1)
> >
> > -TEST ls $M0/f3
> > -TEST cat $M0/f3
> > +TEST ls $M0 #Trigger entry heal via readdir inode refresh
> > +TEST cat $M0/f3 #Trigger data heal via readv inode refresh
> > EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
> >
> > #gfid split-brain should be resolved
> > @@ -215,8 +215,8 @@ TEST $CLI volume start $V0 force
> > EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status $V0 $H0 $B0/${V0}2
> > EXPECT_WITHIN $CHILD_UP_TIMEOUT "1" afr_child_up_status $V0 2
> >
> > -TEST ls $M0/f4
> > -TEST cat $M0/f4
> > +TEST ls $M0 #Trigger entry heal via readdir inode refresh
> > +TEST cat $M0/f4 #Trigger data heal via readv inode refresh
> > EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0
> >
> > #gfid split-brain should be resolved
> > diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
> > index 61f21795e..319665a14 100644
> > --- a/xlators/cluster/afr/src/afr-common.c
> > +++ b/xlators/cluster/afr/src/afr-common.c
> > @@ -282,7 +282,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local,
> > metadatamap |= (1 << index);
> > }
> > if (metadatamap_old != metadatamap) {
> > - event = 0;
> > + __afr_inode_need_refresh_set(inode, this);
> > }
> > break;
> >
> > @@ -295,7 +295,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local,
> > datamap |= (1 << index);
> > }
> > if (datamap_old != datamap)
> > - event = 0;
> > + __afr_inode_need_refresh_set(inode, this);
> > break;
> >
> > default:
> > @@ -458,34 +458,6 @@ out:
> > return ret;
> > }
> >
> > -int
> > -__afr_inode_event_gen_reset_small(inode_t *inode, xlator_t *this)
> > -{
> > - int ret = -1;
> > - uint16_t datamap = 0;
> > - uint16_t metadatamap = 0;
> > - uint32_t event = 0;
> > - uint64_t val = 0;
> > - afr_inode_ctx_t *ctx = NULL;
> > -
> > - ret = __afr_inode_ctx_get(this, inode, &ctx);
> > - if (ret)
> > - return ret;
> > -
> > - val = ctx->read_subvol;
> > -
> > - metadatamap = (val & 0x000000000000ffff) >> 0;
> > - datamap = (val & 0x00000000ffff0000) >> 16;
> > - event = 0;
> > -
> > - val = ((uint64_t)metadatamap) | (((uint64_t)datamap) << 16) |
> > - (((uint64_t)event) << 32);
> > -
> > - ctx->read_subvol = val;
> > -
> > - return ret;
> > -}
> > -
> > int
> > __afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data,
> > unsigned char *metadata, int *event_p)
> > @@ -556,22 +528,6 @@ out:
> > return ret;
> > }
> >
> > -int
> > -__afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
> > -{
> > - afr_private_t *priv = NULL;
> > - int ret = -1;
> > -
> > - priv = this->private;
> > -
> > - if (priv->child_count <= 16)
> > - ret = __afr_inode_event_gen_reset_small(inode, this);
> > - else
> > - ret = -1;
> > -
> > - return ret;
> > -}
> > -
> > int
> > afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data,
> > unsigned char *metadata, int *event_p)
> > @@ -721,30 +677,22 @@ out:
> > return need_refresh;
> > }
> >
> > -static int
> > -afr_inode_need_refresh_set(inode_t *inode, xlator_t *this)
> > +int
> > +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this)
> > {
> > int ret = -1;
> > afr_inode_ctx_t *ctx = NULL;
> >
> > - GF_VALIDATE_OR_GOTO(this->name, inode, out);
> > -
> > - LOCK(&inode->lock);
> > - {
> > - ret = __afr_inode_ctx_get(this, inode, &ctx);
> > - if (ret)
> > - goto unlock;
> > -
> > + ret = __afr_inode_ctx_get(this, inode, &ctx);
> > + if (ret == 0) {
> > ctx->need_refresh = _gf_true;
> > }
> > -unlock:
> > - UNLOCK(&inode->lock);
> > -out:
> > +
> > return ret;
> > }
> >
> > int
> > -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
> > +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this)
> > {
> > int ret = -1;
> >
> > @@ -754,7 +702,7 @@ afr_inode_event_gen_reset(inode_t *inode, xlator_t *this)
> > "Resetting event gen for %s", uuid_utoa(inode->gfid));
> > LOCK(&inode->lock);
> > {
> > - ret = __afr_inode_event_gen_reset(inode, this);
> > + ret = __afr_inode_need_refresh_set(inode, this);
> > }
> > UNLOCK(&inode->lock);
> > out:
> > @@ -1187,7 +1135,7 @@ afr_txn_refresh_done(call_frame_t *frame, xlator_t *this, int err)
> > ret = afr_inode_get_readable(frame, inode, this, local->readable,
> > &event_generation, local->transaction.type);
> >
> > - if (ret == -EIO || (local->is_read_txn && !event_generation)) {
> > + if (ret == -EIO) {
> > /* No readable subvolume even after refresh ==> splitbrain.*/
> > if (!priv->fav_child_policy) {
> > err = EIO;
> > @@ -2451,7 +2399,7 @@ afr_lookup_done(call_frame_t *frame, xlator_t *this)
> > if (read_subvol == -1)
> > goto cant_interpret;
> > if (ret) {
> > - afr_inode_event_gen_reset(local->inode, this);
> > + afr_inode_need_refresh_set(local->inode, this);
> > dict_del_sizen(local->replies[read_subvol].xdata, GF_CONTENT_KEY);
> > }
> > } else {
> > @@ -3007,6 +2955,7 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this)
> > afr_private_t *priv = NULL;
> > afr_local_t *local = NULL;
> > int read_subvol = -1;
> > + int ret = 0;
> > unsigned char *data_readable = NULL;
> > unsigned char *success_replies = NULL;
> >
> > @@ -3028,7 +2977,10 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this)
> > if (!afr_has_quorum(success_replies, this, frame))
> > goto unwind;
> >
> > - afr_replies_interpret(frame, this, local->inode, NULL);
> > + ret = afr_replies_interpret(frame, this, local->inode, NULL);
> > + if (ret) {
> > + afr_inode_need_refresh_set(local->inode, this);
> > + }
> >
> > read_subvol = afr_read_subvol_decide(local->inode, this, NULL,
> > data_readable);
> > @@ -3284,11 +3236,7 @@ afr_discover(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req)
> > afr_read_subvol_get(loc->inode, this, NULL, NULL, &event,
> > AFR_DATA_TRANSACTION, NULL);
> >
> > - if (afr_is_inode_refresh_reqd(loc->inode, this, event,
> > - local->event_generation))
> > - afr_inode_refresh(frame, this, loc->inode, NULL, afr_discover_do);
> > - else
> > - afr_discover_do(frame, this, 0);
> > + afr_discover_do(frame, this, 0);
> >
> > return 0;
> > out:
> > @@ -3429,11 +3377,7 @@ afr_lookup(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req)
> > afr_read_subvol_get(loc->parent, this, NULL, NULL, &event,
> > AFR_DATA_TRANSACTION, NULL);
> >
> > - if (afr_is_inode_refresh_reqd(loc->inode, this, event,
> > - local->event_generation))
> > - afr_inode_refresh(frame, this, loc->parent, NULL, afr_lookup_do);
> > - else
> > - afr_lookup_do(frame, this, 0);
> > + afr_lookup_do(frame, this, 0);
> >
> > return 0;
> > out:
> > diff --git a/xlators/cluster/afr/src/afr-dir-write.c b/xlators/cluster/afr/src/afr-dir-write.c
> > index 82a72fddd..333085b14 100644
> > --- a/xlators/cluster/afr/src/afr-dir-write.c
> > +++ b/xlators/cluster/afr/src/afr-dir-write.c
> > @@ -119,11 +119,11 @@ __afr_dir_write_finalize(call_frame_t *frame, xlator_t *this)
> > continue;
> > if (local->replies[i].op_ret < 0) {
> > if (local->inode)
> > - afr_inode_event_gen_reset(local->inode, this);
> > + afr_inode_need_refresh_set(local->inode, this);
> > if (local->parent)
> > - afr_inode_event_gen_reset(local->parent, this);
> > + afr_inode_need_refresh_set(local->parent, this);
> > if (local->parent2)
> > - afr_inode_event_gen_reset(local->parent2, this);
> > + afr_inode_need_refresh_set(local->parent2, this);
> > continue;
> > }
> >
> > diff --git a/xlators/cluster/afr/src/afr.h b/xlators/cluster/afr/src/afr.h
> > index a3f2942b3..ed6d777c1 100644
> > --- a/xlators/cluster/afr/src/afr.h
> > +++ b/xlators/cluster/afr/src/afr.h
> > @@ -958,7 +958,10 @@ afr_inode_read_subvol_set(inode_t *inode, xlator_t *this,
> > int event_generation);
> >
> > int
> > -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this);
> > +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this);
> > +
> > +int
> > +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this);
> >
> > int
> > afr_read_subvol_select_by_policy(inode_t *inode, xlator_t *this,
> > --
> > 2.25.2
> >
>
>
>
> Erik Jacobson
> Software Engineer
>
> erik.jacobson at hpe.com
> +1 612 851 0550 Office
>
> Eagan, MN
> hpe.com
Erik Jacobson
Software Engineer
erik.jacobson at hpe.com
+1 612 851 0550 Office
Eagan, MN
hpe.com
More information about the Gluster-users
mailing list