[Bugs] [Bug 1440635] New: Application VMs with their disk images on sharded-replica 3 volume are unable to boot after performing rebalance

Mon Apr 10 07:26:19 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1440635

            Bug ID: 1440635
           Summary: Application VMs with their disk images on
                    sharded-replica 3 volume are unable to boot after
                    performing rebalance
           Product: GlusterFS
           Version: 3.8
         Component: distribute
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: kdhananj at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    kdhananj at redhat.com, knarra at redhat.com,
                    rcyriac at redhat.com, rgowdapp at redhat.com,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    sasundar at redhat.com, storage-qa-internal at redhat.com
        Depends On: 1440051

+++ This bug was initially created as a clone of Bug #1440051 +++

+++ This bug was initially created as a clone of Bug #1439753 +++

+++ This bug was initially created as a clone of Bug #1434653 +++

Description of problem:
-----------------------
5 VM disk images are created on the fuse mounted sharded replica 3 volume of
type 1x3. 5 VMs are installed, rebooted and are up. 3 more bricks are added to
this volume to make it as 2x3. After performing rebalance, observed some wierd
errors not allowing to login in to these VMs. When rebooted these VMs, they are
unable to boot, which means that the VM disks are corrupted.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-3.8.10

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Create a sharded replica 3 volume
2. Optimize the volume for virt store usecase ( gluster volume set <vol> group
virt ) and start the volume
3. Fuse mount the volume on another RHEL 7.3 Server ( used as hypervisor )
4. Create few disk images of size 10GB each
5. Start the VMs, install OS (RHEL 7.3) and reboot
6. When the VMs are up post installation, add 3 more bricks to the volume
7. Start rebalance process

Actual results:
---------------
VMs showed some errors on the console, which prevented from logging in. 
Post rebalance, when the VMs are rebooted, they are unable to boot with boot
prompt showing up messages related to XFS inode corruption

Expected results:
-----------------
VM disks should not get corrupted.

--- Additional comment from SATHEESARAN on 2017-03-21 23:20:28 EDT ---

Setup Information
------------------

3. Volume info
--------------
# gluster volume info

Volume Name: trappist1
Type: Distributed-Replicate
Volume ID: 30e12835-0c21-4037-9f83-5556f3c637b6
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: server1/gluster/brick1/b1
Brick2: server2:/gluster/brick1/b1
Brick3: server3:/gluster/brick1/b1
Brick4: server3::/gluster/brick2/b2 --> new brick added
Brick5: server1:/gluster/brick2/b2 --> new brick added
Brick6: server2:/gluster/brick2/b2 --> new brick added
Options Reconfigured:
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: off
performance.low-prio-threads: 32
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

4. sharding related info
-------------------------
sharding is enabled on this volume with shard-block-size set to 4MB ( which is
default )
[ granular-entry-heal enabled ]

5. Hypervisor details
----------------------
Host: rhs-client15.lab.eng.blr.redhat.com
mountpoint: /mnt/repvol

6.Virtual machine details
--------------------------
There are 4 virtual machines running in this host namely vm1,vm2,vm3,vm4,vm5
with their disk images on the fuse mounted gluster volume

[root at rhs-client15 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 6     vm1                            running
 7     vm2                            running
 8     vm3                            running
 9     vm4                            running
 10    vm5                            running

I have tested again with all the application VMs powered off. All VMs could
boot healthy. The following are the test steps :

1. Create a sharded replica 3 volume and optimized the volume for virt store
usecase
2. Created 5 VM image files on the fuse mounted gluster volume
3. Created 5 Application VMs with the above created VM images and installed OS
( RHEL7.3 ). Rebooted the VMs post OS installation.
4. Checked the health of all the VMs ( all VMs are healthy )
5. Powered off all the application VMs
6. Added 3 more bricks to convert 1x3 replicate volume to 2x3
distribute-replicate volume 
7. Initiated rebalance
8. Post rebalance has completed, started all the VMs. ( All VMs booted up
healthy ) 

So, its the running VMs that are getting affected because of rebalance
operation.

--- Additional comment from Raghavendra G on 2017-03-26 21:27:18 EDT ---

Conversation over mail:

> Raghu,
>
> In one of my test iteration, fix-layout itself caused corruption with VM
> disk.
> It happened only once, when I tried twice after that it never happened

One test is good enough to prove that we are dealing with at least one
corruption issue, that is not the same as bz 1376757.

We need more analysis to figure out RCA.

>
> Thanks,
> Satheesaran S ( sas )

--- Additional comment from SATHEESARAN on 2017-03-27 04:12:12 EDT ---

I have ran the test with the following combinations:
- Turning off strict-o-direct, and enabling remote-dio
I could still observe that VM disks are getting corrupted.

Also I did another test with sharding turned off, this issue was not seen.

--- Additional comment from Nithya Balachandran on 2017-03-29 23:28:52 EDT ---

Hi,

Is the system on which the issue was hit still available?

Thanks,
Nithya

--- Additional comment from Raghavendra G on 2017-04-01 00:52:31 EDT ---

Following is a rough algorithm of shard_writev:

1. Based on the offset, calculate the shards touched by current write.
2. Look for inodes corresponding to these shard files in itable.
3. If one or more inodes are missing from itable, issue mknod for corresponding
shard files and ignore EEXIST in cbk.
4. resume writes on respective shards.

Now, imagine a write which falls to an existing "shard_file". For the sake of
discussion lets consider a distribute of three subvols - s1, s2, s3

1. "shard_file" hashes to subvolume s2 and is present on s2
2. add a subvolume s4 and initiate a fix layout. The layout of ".shard" is
fixed to include s4 and hash ranges are changed.
3. write that touches "shard_file" is issued.
4. The inode for "shard_file" is not present in itable after a graph switch and
features/shard issues an mknod.
5. With new layout of .shard, lets say "shard_file" hashes to s3 and mknod
(shard_file) on s3 succeeds. But, the shard_file is already present on s2.

So, we have two files on two different subvols of dht representing same shard
and this will lead to corruption.

To prove the above hypothesis we need to look for one or more files (say
"shard_file") in .shard present in more than one subvolume of dht. IOW, more
than one subvolume of dht should have the file "/.shard/shard_file".

@Sas,

Is the setup still available? If yes, can you please take a look? Or if you can
give me login details, I'll take a look. If the setup is not available, can you
recreate the issue one more time so that I can take a look?

regards,
Raghavendra

--- Additional comment from Krutika Dhananjay on 2017-04-03 04:13:44 EDT ---

Whatever Raghavendra suspected in comment #12 is what we observed on sas' setup
just now.

Following are the duplicate shards that exist on both subvolumes of DHT:

[root at dhcp37-65 tmp]# cat /tmp/shards-replicate-1 | sort | uniq -c | grep -v "1
"              
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.1397
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.1864
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.487
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.552
      2 702cd056-84d5-4c83-9232-cca363f2b3a7.7
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.487
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.509
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.521
      2 7a56bb45-91a0-49f4-a983-a8a46c418e04.7
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.1397
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.1398
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.576
      2 a37ab9d5-2f18-4916-9315-52476bd7ff54.7
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1397
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1398
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1867
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.1868
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.2
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.487
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.552
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.576
      2 deaddfc9-f95e-4c26-9d4b-82d96e95057a.941
      2 ede69d31-f048-41b7-9173-448c7046d537.1397
      2 ede69d31-f048-41b7-9173-448c7046d537.1398
      2 ede69d31-f048-41b7-9173-448c7046d537.487
      2 ede69d31-f048-41b7-9173-448c7046d537.552
      2 ede69d31-f048-41b7-9173-448c7046d537.576
      2 ede69d31-f048-41b7-9173-448c7046d537.7

Worse yet, the md5sums of the two copies differ.

For instance,

On replicate-0:
[root at dhcp37-65 tmp]# md5sum
/gluster/brick1/b1/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397
1e86d0a097c724965413d07af71c0809 
/gluster/brick1/b1/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397

On replicate-1:
[root at dhcp37-85 tmp]# md5sum
/gluster/brick2/b2/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397
e72cc949c7ba9b76d350a77be932ba3f 
/gluster/brick2/b2/.shard/702cd056-84d5-4c83-9232-cca363f2b3a7.1397

Raghavendra will be sending out a fix in DHT for this issue.

--- Additional comment from SATHEESARAN on 2017-04-03 04:23:01 EDT ---

(In reply to Raghavendra G from comment #12)

> @Sas,
> 
> Is the setup still available? If yes, can you please take a look? Or if you
> can give me login details, I'll take a look. If the setup is not available,
> can you recreate the issue one more time so that I can take a look?
> 
> regards,
> Raghavendra

I have already shared the setup details in the mail.
Let me know, if you need anything more

--- Additional comment from Raghavendra G on 2017-04-04 00:44:56 EDT ---

The fix itself is fairly simple:

In all entry fops - create, mknod, symlink, open with O_CREATE, link, rename,
mkdir etc, we have to do:

Check volume commit hash is equal to the commit hash on parent inode
1. If yes, proceed with the dentry fop
2. else, 
   a. initiate a lookup(frame, this, loc). IOW, Wind the lookup on the location
structure passed as an arg to DHT (not directly to its subvols)
   b. Once all lookups initiated in "a." are complete, resume the dentry fop.

For the scope of this bug its sufficient to fix dht_mknod. But, for
completeness sake (and to avoid similar bugs in other codepaths [2]) I would
prefer to fix all codepaths. So, the codepaths affected are more and hence more
testing.

[1] is another VM corruption issue during rebalance.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1276062
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1286127

--- Additional comment from Worker Ant on 2017-04-07 04:36:38 EDT ---

REVIEW: https://review.gluster.org/17010 (features/shard: Fix vm corruption
upon fix-layout) posted (#1) for review on master by Krutika Dhananjay
(kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2017-04-07 14:52:03 EDT ---

COMMIT: https://review.gluster.org/17010 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com) 
------
commit 99c8c0b03a3368d81756440ab48091e1f2430a5f
Author: Krutika Dhananjay <kdhananj at redhat.com>
Date:   Thu Apr 6 18:10:41 2017 +0530

    features/shard: Fix vm corruption upon fix-layout

    shard's writev implementation, as part of identifying
    presence of participant shards that aren't in memory,
    first sends an MKNOD on these shards, and upon EEXIST error,
    looks up the shards before proceeding with the writes.

    The VM corruption was caused when the following happened:
    1. DHT had n subvolumes initially.
    2. Upon add-brick + fix-layout, the layout of .shard changed
       although the existing shards under it were yet to be migrated
       to their new hashed subvolumes.
    3. During this time, there were writes on the VM falling in regions
       of the file whose corresponding shards were already existing under
       .shard.
    4. Sharding xl sent MKNOD on these shards, now creating them in their
       new hashed subvolumes although there already exist shard blocks for
       this region with valid data.
    5. All subsequent writes were wound on these newly created copies.

    The net outcome is that both copies of the shard didn't have the correct
    data. This caused the affected VMs to be unbootable.

    FIX:
    For want of better alternatives in DHT, the fix changes shard fops to do
    a LOOKUP before the MKNOD and upon EEXIST error, perform another lookup.

    Change-Id: I8a2e97d91ba3275fbc7174a008c7234fa5295d36
    BUG: 1440051
    RCA'd-by: Raghavendra Gowdappa <rgowdapp at redhat.com>
    Reported-by: Mahdi Adnan <mahdi.adnan at outlook.com>
    Signed-off-by: Krutika Dhananjay <kdhananj at redhat.com>
    Reviewed-on: https://review.gluster.org/17010
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu at redhat.com>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>

--- Additional comment from Worker Ant on 2017-04-10 02:00:02 EDT ---

REVIEW: https://review.gluster.org/17014 (features/shard: Initialize local->fop
in readv) posted (#1) for review on master by Krutika Dhananjay
(kdhananj at redhat.com)

--- Additional comment from Worker Ant on 2017-04-10 02:56:41 EDT ---

REVIEW: https://review.gluster.org/17014 (features/shard: Initialize local->fop
in readv) posted (#2) for review on master by Krutika Dhananjay
(kdhananj at redhat.com)

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1440051
[Bug 1440051] Application VMs with their disk images on sharded-replica 3
volume are unable to boot after performing rebalance
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.