[Bugs] [Bug 1408712] New: with granular-entry-self-heal enabled i see that there is a gfid mismatch and vm goes to paused state after migrating to another host

Mon Dec 26 14:57:40 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1408712

            Bug ID: 1408712
           Summary: with granular-entry-self-heal enabled i see that there
                    is a gfid mismatch and vm goes to paused state after
                    migrating to another host
           Product: GlusterFS
           Version: mainline
         Component: replicate
          Keywords: Triaged
          Severity: high
          Assignee: kdhananj at redhat.com
          Reporter: kdhananj at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    knarra at redhat.com, ksandha at redhat.com,
                    nchilaka at redhat.com, rcyriac at redhat.com,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    sasundar at redhat.com, storage-qa-internal at redhat.com
        Depends On: 1408426
            Blocks: 1277939 (Gluster-HC-2), 1351528, 1400057

+++ This bug was initially created as a clone of Bug #1408426 +++

Description of problem:
vm creation happens when one of the data brick is down and once the brick is
brought up back i see that there are some entries which does not get healed and
when the vm is migrated to another node it goes to paused state by logging the
following errors in the mount logs.

[2016-12-23 09:14:16.481519] W [MSGID: 108008]
[afr-self-heal-name.c:369:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for
<gfid:be318638-e8a0-4c6d-977d-7a937aa84806>/f735902d-12fa-4e4d-88c9-1b8ba06e3063.1673
6e17b733-b8a4-4563-bc3d-f659c9a46c2a on engine-client-1 and
55648f43-7e09-4e62-b7d2-16fe1ff7b23e on engine-client-0
[2016-12-23 09:14:16.482442] E [MSGID: 133010]
[shard.c:1582:shard_common_lookup_shards_cbk] 0-engine-shard: Lookup on shard
1673 failed. Base file gfid = f735902d-12fa-4e4d-88c9-1b8ba06e3063
[Input/output error]
[2016-12-23 09:14:16.482474] W [fuse-bridge.c:2228:fuse_readv_cbk]
0-glusterfs-fuse: 11280842: READ => -1
gfid=f735902d-12fa-4e4d-88c9-1b8ba06e3063 fd=0x7faeda380210 (Input/output
error)
[2016-12-23 10:08:41.956330] W [MSGID: 108008]
[afr-self-heal-name.c:369:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for
<gfid:be318638-e8a0-4c6d-977d-7a937aa84806>/f735902d-12fa-4e4d-88c9-1b8ba06e3063.1673
6e17b733-b8a4-4563-bc3d-f659c9a46c2a on engine-client-1 and
55648f43-7e09-4e62-b7d2-16fe1ff7b23e on engine-client-0
[2016-12-23 10:08:41.957422] E [MSGID: 133010]
[shard.c:1582:shard_common_lookup_shards_cbk] 0-engine-shard: Lookup on shard
1673 failed. Base file gfid = f735902d-12fa-4e4d-88c9-1b8ba06e3063
[Input/output error]
[2016-12-23 10:08:41.957444] W [fuse-bridge.c:2228:fuse_readv_cbk]
0-glusterfs-fuse: 11427307: READ => -1
gfid=f735902d-12fa-4e4d-88c9-1b8ba06e3063 fd=0x7faeda380328 (Input/output
error)
[2016-12-23 10:45:10.609600] W [MSGID: 108008]
[afr-self-heal-name.c:369:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for
<gfid:be318638-e8a0-4c6d-977d-7a937aa84806>/f735902d-12fa-4e4d-88c9-1b8ba06e3063.1673
6e17b733-b8a4-4563-bc3d-f659c9a46c2a on engine-client-1 and
55648f43-7e09-4e62-b7d2-16fe1ff7b23e on engine-client-0
[2016-12-23 10:45:10.610550] E [MSGID: 133010]
[shard.c:1582:shard_common_lookup_shards_cbk] 0-engine-shard: Lookup on shard
1673 failed. Base file gfid = f735902d-12fa-4e4d-88c9-1b8ba06e3063
[Input/output error]
[2016-12-23 10:45:10.610574] W [fuse-bridge.c:2228:fuse_readv_cbk]
0-glusterfs-fuse: 11526955: READ => -1
gfid=f735902d-12fa-4e4d-88c9-1b8ba06e3063 fd=0x7faeda380184 (Input/output
error)

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-9.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install HC with three nodes.
2. Create a arbiter volume and enable all the options using gdeploy.
3. Now bring down the first brick in the arbiter volume and create vm.
4. Once the vm creation is completed, bring back the brick and wait for self
heal to happen.
5. Now migrate the vm to another host.

Actual results:
There are two issues which i have seen.
1) There are still some entries present in the node which are not healed even
after a long time
2) And once the vm is migrated i see that vm goes to paused state.

Expected results:
Vm should not go to paused state after migration plus there should not be any
entries present in volume heal info.

Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-12-23
05:56:11 EST ---

This bug is automatically being proposed for the current release of Red Hat
Gluster Storage 3 under active development, by setting the release flag
'rhgs‑3.2.0' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from RamaKasturi on 2016-12-23 05:59:44 EST ---

As suggested by pranith i disabled granluar entry self heal on the volume and i
do not see the issue

--- Additional comment from Krutika Dhananjay on 2016-12-26 05:41:04 EST ---

Resuming from https://bugzilla.redhat.com/show_bug.cgi?id=1400057#c11 to
explain why there would be a gfid mismatch. So please go through
https://bugzilla.redhat.com/show_bug.cgi?id=1400057#c11 first.

... the pending xattrs on .shard are at this point erased. Now when the brick
that was down comes back online, another MKNOD on this shard's name triggered
by shard readv fop, whenever it happens, would cause the fop to give EEXIST
from the bricks that were already online; and on the brick that was previously
offline, the creation of this shard would succeed, although with a new gfid.
This leads to the gfid mismatch.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1277939
[Bug 1277939] (Gluster-HC-2) [TRACKER] Gluster Hyperconvergence - MVP
https://bugzilla.redhat.com/show_bug.cgi?id=1400057
[Bug 1400057] self-heal not happening, as self-heal info lists the same
pending shards to be healed
https://bugzilla.redhat.com/show_bug.cgi?id=1408426
[Bug 1408426] with granular-entry-self-heal enabled i see that there is a
gfid mismatch and vm goes to paused state after migrating to another host
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=xWD8duEGu0&a=cc_unsubscribe