[Bugs] [Bug 1403743] New: self-heal not happening, as self-heal info lists the same pending shards to be healed

Mon Dec 12 10:16:29 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1403743

            Bug ID: 1403743
           Summary: self-heal not happening, as self-heal info lists the
                    same pending shards to be healed
           Product: GlusterFS
           Version: 3.9
         Component: replicate
          Keywords: Triaged
          Severity: high
          Assignee: kdhananj at redhat.com
          Reporter: kdhananj at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    kdhananj at redhat.com, rcyriac at redhat.com,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    sasundar at redhat.com, storage-qa-internal at redhat.com
        Depends On: 1402730
            Blocks: 1400057, 1403646

+++ This bug was initially created as a clone of Bug #1402730 +++

+++ This bug was initially created as a clone of Bug #1400057 +++

Description of problem:
-----------------------
After killing one of bricks of replica 3 in a cyclic fashion ( kill brick1,
bring brick1 up, self-heal in-progress, repeat with other 2 bricks ), I see
that healing is not happening for few shards.

self-heal info shows the same number of entries for long time

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.2.0 interim build ( glusterfs-3.8.4-6.el7rhgs )
RHEL 7.3
qemu-kvm-rhev-2.6.0-27.el7.x86_64

How reproducible:
-----------------
1/1

Steps to Reproduce:
-------------------
1. Create replica 3 volume and optimize the volume for VM store. Enable
granular-entry-heal and also compound-fops
2. Fuse mount the volume and create 2 VM images of size 50G ( sparse, thin
allocated )
3. Create 2 app VMs with these images and install RHEL 7.3 on them
4. Keep doing I/O from these VMs using 'dd'
5. Kill brick1, wait for number of entries to heal grows up to some 1000
6. Start brick1 by force starting the volume
7. Let the self-heal happen but before it could complete, kill brick2
8. After few more minutes, start brick2 by force starting the volume
9. Let the self-heal happen but before it could complete, kill brick3
10. After few more minutes, start brick3 by force starting the volume
11. Do one more iteration of 5-10

Actual results:
---------------
self-heal doesn't happen for few shards

Expected results:
-----------------
self-heal should heal all the pending entries

--- Additional comment from SATHEESARAN on 2016-11-30 06:38:08 EST ---

Volume information
-------------------

# gluster volume info vmvol

Volume Name: vmvol
Type: Replicate
Volume ID: 0b3267c4-75ac-4660-a602-54a3eb331de5
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: server1:/gluster/brick2/bk
Brick2: server2:/gluster/brick2/bk
Brick3: server3:/gluster/brick2/bk
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
storage.owner-uid: 36
storage.owner-gid: 36
features.shard-block-size: 4MB
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
features.shard: on
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
performance.strict-o-direct: on
network.ping-timeout: 30
user.cifs: off
cluster.use-compound-fops: on
cluster.granular-entry-heal: enable

Pending heals :
---------------
I have a script which looks for unsynced entries as the heal happens. I am
pasting the logs from this script, while the test was going on.

Wed Nov 30 13:16:50 IST 2016    Unsynced entries on all bricks: 1940 - 2016
Wed Nov 30 13:18:01 IST 2016    Unsynced entries on all bricks: 2211 - 2267
Wed Nov 30 13:19:54 IST 2016    Unsynced entries on all bricks: - 55 1988
Wed Nov 30 13:20:53 IST 2016    Unsynced entries on all bricks: - 179 1868
Wed Nov 30 13:21:49 IST 2016    Unsynced entries on all bricks: - 298 1753
Wed Nov 30 13:22:54 IST 2016    Unsynced entries on all bricks: 2014 493 -
Wed Nov 30 13:23:50 IST 2016    Unsynced entries on all bricks: 1921 675 -
Wed Nov 30 13:25:07 IST 2016    Unsynced entries on all bricks: 1774 828 1694
Wed Nov 30 13:27:11 IST 2016    Unsynced entries on all bricks: 1740 - 1613
Wed Nov 30 13:28:56 IST 2016    Unsynced entries on all bricks: 1643 - 1855
Wed Nov 30 13:29:47 IST 2016    Unsynced entries on all bricks: 1896 - 2017
Wed Nov 30 13:30:41 IST 2016    Unsynced entries on all bricks: 2057 683 1856
Wed Nov 30 13:33:19 IST 2016    Unsynced entries on all bricks: 637 460 1657
Wed Nov 30 13:34:23 IST 2016    Unsynced entries on all bricks: - 393 1565
Wed Nov 30 13:35:22 IST 2016    Unsynced entries on all bricks: - 312 1454
Wed Nov 30 13:36:09 IST 2016    Unsynced entries on all bricks: - 323 1301
Wed Nov 30 13:36:52 IST 2016    Unsynced entries on all bricks: - 369 1159
Wed Nov 30 13:37:31 IST 2016    Unsynced entries on all bricks: - 410 1045
Wed Nov 30 13:38:03 IST 2016    Unsynced entries on all bricks: - 441 976
Wed Nov 30 13:38:36 IST 2016    Unsynced entries on all bricks: - 480 894
Wed Nov 30 13:39:09 IST 2016    Unsynced entries on all bricks: - 512 774
Wed Nov 30 13:39:43 IST 2016    Unsynced entries on all bricks: - 548 658
Wed Nov 30 13:40:14 IST 2016    Unsynced entries on all bricks: - 579 548
Wed Nov 30 13:40:44 IST 2016    Unsynced entries on all bricks: - 609 542
Wed Nov 30 13:41:05 IST 2016    Unsynced entries on all bricks: - 687 616
Wed Nov 30 13:41:29 IST 2016    Unsynced entries on all bricks: - 771 694
Wed Nov 30 13:41:56 IST 2016    Unsynced entries on all bricks: - 859 780
Wed Nov 30 13:42:26 IST 2016    Unsynced entries on all bricks: - 950 902
Wed Nov 30 13:43:00 IST 2016    Unsynced entries on all bricks: - 1051 1007
Wed Nov 30 13:43:35 IST 2016    Unsynced entries on all bricks: - 1149 1097
Wed Nov 30 13:44:10 IST 2016    Unsynced entries on all bricks: - 1240 1199
Wed Nov 30 13:44:48 IST 2016    Unsynced entries on all bricks: - 1378 1317
Wed Nov 30 13:45:29 IST 2016    Unsynced entries on all bricks: - 1473 1434
Wed Nov 30 13:46:11 IST 2016    Unsynced entries on all bricks: - 1587 1533
Wed Nov 30 13:46:54 IST 2016    Unsynced entries on all bricks: - 1689 1652
Wed Nov 30 13:47:43 IST 2016    Unsynced entries on all bricks: - 1810 1785
Wed Nov 30 13:48:36 IST 2016    Unsynced entries on all bricks: - 1968 1902
Wed Nov 30 13:49:28 IST 2016    Unsynced entries on all bricks: - 2093 2069
Wed Nov 30 13:50:27 IST 2016    Unsynced entries on all bricks: - 2229 2214
Wed Nov 30 13:51:29 IST 2016    Unsynced entries on all bricks: - 2384 2373
Wed Nov 30 13:52:32 IST 2016    Unsynced entries on all bricks: - 2528 2498
Wed Nov 30 13:53:38 IST 2016    Unsynced entries on all bricks: - 2593 2434
Wed Nov 30 13:54:54 IST 2016    Unsynced entries on all bricks: 1733 2184 1854
Wed Nov 30 13:56:42 IST 2016    Unsynced entries on all bricks: 1733 1577 1322
Wed Nov 30 13:58:12 IST 2016    Unsynced entries on all bricks: 1734 978 751
Wed Nov 30 14:00:17 IST 2016    Unsynced entries on all bricks: 1498 582 423
Wed Nov 30 14:01:49 IST 2016    Unsynced entries on all bricks: 1292 322 190
Wed Nov 30 14:02:57 IST 2016    Unsynced entries on all bricks: 1140 219 124
Wed Nov 30 14:03:28 IST 2016    Unsynced entries on all bricks: 1020 219 124
Wed Nov 30 14:03:54 IST 2016    Unsynced entries on all bricks: 896 219 124
Wed Nov 30 14:04:17 IST 2016    Unsynced entries on all bricks: 778 219 124
Wed Nov 30 14:04:38 IST 2016    Unsynced entries on all bricks: 680 219 124
Wed Nov 30 14:04:57 IST 2016    Unsynced entries on all bricks: 586 219 124
Wed Nov 30 14:05:16 IST 2016    Unsynced entries on all bricks: 503 219 124
Wed Nov 30 14:05:32 IST 2016    Unsynced entries on all bricks: 424 219 124
Wed Nov 30 14:05:45 IST 2016    Unsynced entries on all bricks: 360 219 124
Wed Nov 30 14:05:58 IST 2016    Unsynced entries on all bricks: 296 219 124
Wed Nov 30 14:06:10 IST 2016    Unsynced entries on all bricks: 240 219 124
Wed Nov 30 14:06:22 IST 2016    Unsynced entries on all bricks: 191 219 124
Wed Nov 30 14:06:32 IST 2016    Unsynced entries on all bricks: 140 219 124
Wed Nov 30 14:06:41 IST 2016    Unsynced entries on all bricks: 111 219 124
Wed Nov 30 14:06:46 IST 2016    Unsynced entries on all bricks: 111 219 124
Wed Nov 30 14:06:53 IST 2016    Unsynced entries on all bricks: 111 219 124
....
Wed Nov 30 17:06:10 IST 2016    Unsynced entries on all bricks: 111 219 124

--- Additional comment from Krutika Dhananjay on 2016-12-01 02:01:21 EST ---

The bug is consistently recreatable with shard and granular entry heal enabled
with a simple dd of 1G into a file and during this time, bringing bricks
offline and online in cyclic order (all of this while heal is disabled). And
enabling heal in the end and monitoring heal-info shows that one entry always
shows up in heal-info and that's 
I was not able to hit the bug without granular-entry-heal. This is possibly a
bug in granular entry heal code.

-Krutika

--- Additional comment from Krutika Dhananjay on 2016-12-08 04:28:49 EST ---

First of all, great catch, Sas!

Now for the test done and the RCA.

TEST:
=====
I ran this same test by modifying shard xl to accept 1KB shard-block-size
solely for the purpose of increasing the likelihood of hitting this bug and
started a dd of 10G on the volume and in the meantime brought bricks offline
and online in cyclic order (also with shd-side and client-side heal disabled).

After the third brick was brought offline and dd failed with EROFS/EIO, I
captured the xattrs of all the different shards created on all the replicas and
also the entries under
indices/entry-changes/be318638-e8a0-4c6d-977d-7a937aa84806 on all three
replicas.

Then I brought the third brick back up and enabled self-heal. As Sas pointed
out, after a while heal came to a halt. I captured the heal-info entries and
checked for their names in the entry-changes output captured.
For all of these entries, there was NO entry-changes index corresponding to
these entry names on the source bricks. As a result, these entries never got
healed in the first place. Because of this, even data and metadata heal didn't
proceed and so heal-info count ceased to decrease at this point.

RCA:
====
So the bug seems to be in AFR's transaction (not related to granular-entry-heal
code itself because this issue can happen even with non-granular indices in
entry txns although it is easier to hit with granular-esh on) and caused by a
case of skipped post-op.

In the code there is only one case where this can happen (which is due to a bug
of course).

At the start of the MKNOD txn of these affected shards, all three replicas were
up. This caused AFR to set local->optimistic_change_log to TRUE (which means
AFR can skip the pre-op phase and in the event there is a failure, it will have
to execute the post-op phase of hte transaction before unlocking to set the
pending markers). But before the locking phase, one of the replicas went down.

<Some-code-detail>
Because of this, local->transaction.pre_op[offline-brick-index] = 0
                 local->transaction.failed_subvols[offline-brick-index] = 1
</Some-code-detail>

AFR skips setting the dirty xattrs since local->optimistic_change_log is TRUE.
Then the MKNOD happens.
Then when it comes to doing the post-op (afr_changelog_post_op_now()), AFR
relies on two parameters to decide whether post-op is necessary or not:
1. whether dirty xattr was set in pre-op, and
2. whether on any of the replicas pre-op was done but OP phase failed.

If 1 is true, post-op will be done to undo the dirty xattr setting on the
bricks.
If 2 is true, again post-op will be done.

In all other cases, post-op will be skipped, and it is post-op where
granular-entry indices are created.

But in the case described above where the brick went down just before the
locking phase but was up at the time of txn parameters initialisation, neither
pre-op nor OP was performed on this brick.
Because of this, AFR swould skip performing the post-op. This would imply that
the granular-index won't be set and in the event granular entry heal is
disabled, the index pertaining to the parent directory in indices/xattrop will
be skipped.

--- Additional comment from Worker Ant on 2016-12-08 12:33:50 EST ---

REVIEW: http://review.gluster.org/16075 (cluster/afr: Fix per-txn optimistic
changelog initialisation) posted (#1) for review on master by Krutika Dhananjay
(kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2016-12-09 01:59:09 EST ---

REVIEW: http://review.gluster.org/16075 (cluster/afr: Fix per-txn optimistic
changelog initialisation) posted (#2) for review on master by Krutika Dhananjay
(kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2016-12-12 04:41:08 EST ---

REVIEW: http://review.gluster.org/16075 (cluster/afr: Fix per-txn optimistic
changelog initialisation) posted (#3) for review on master by Krutika Dhananjay
(kdhananj at redhat.com)

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1400057
[Bug 1400057] self-heal not happening, as self-heal info lists the same
pending shards to be healed
https://bugzilla.redhat.com/show_bug.cgi?id=1402730
[Bug 1402730] self-heal not happening, as self-heal info lists the same
pending shards to be healed
https://bugzilla.redhat.com/show_bug.cgi?id=1403646
[Bug 1403646] self-heal not happening, as self-heal info lists the same
pending shards to be healed
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=LENMczXzY8&a=cc_unsubscribe