[Bugs] [Bug 1566822] New: [Remove-brick] Many files were not migrated from the decommissioned bricks; commit results in data loss

bugzilla at redhat.com bugzilla at redhat.com
Fri Apr 13 04:14:29 UTC 2018


https://bugzilla.redhat.com/show_bug.cgi?id=1566822

            Bug ID: 1566822
           Summary: [Remove-brick] Many files were not migrated from the
                    decommissioned bricks; commit results in data loss
           Product: GlusterFS
           Version: 4.0
         Component: distribute
          Severity: high
          Priority: high
          Assignee: bugs at gluster.org
          Reporter: nbalacha at redhat.com
                CC: bugs at gluster.org, rhs-bugs at redhat.com,
                    storage-qa-internal at redhat.com, tdesala at redhat.com
        Depends On: 1553677, 1564198
            Blocks: 1566820



+++ This bug was initially created as a clone of Bug #1564198 +++

+++ This bug was initially created as a clone of Bug #1553677 +++

Description of problem:
=======================
Many files were not migrated from the decommissioned bricks; commit results in
data loss.

Version-Release number of selected component (if applicable):
3.12.2-5.el7rhgs.x86_64

How reproducible:
Reporting at first occurrence

Steps to Reproduce:
===================
1) Create a x3 volume with brick-mux enabled and start it.
2) FUSE mount it on multiple clients.
3) From Client-1 : run script to create folders and files continuously 
 From client-2 : start linux kernel untar
 From client-3 : while true;do find;done
 From client-4 : while true;do ls -lRt;done
4) While step-3 is in-progress, killed server-1 brick process using kill -9
<pid>. 
As brick mux is enabled killing single brick on the server using kill -9 would
take down all the bricks on the node.
5) Now, add 3 bricks to the volume and after few secs immediately start
removing old bricks.
6) Wait for remove-brick to complete.

Actual results:
===============
Many files were not migrated from the decommissioned bricks; commit results in
data loss.

Expected results:
=================
Remove-brick operation should migrate all the files from the decommissioned
brick.

RCA:

The logs from the previous failed runs indicate 2 problems:

1. At least one process could not read directories because the first_up_subvol
was not in the list of local_subvols for the process
2.Since a brick was down, some files would not be migrated if the gfid hashed
to that node-uuid

--- Additional comment from Worker Ant on 2018-04-05 12:17:33 EDT ---

REVIEW: https://review.gluster.org/19827 (cluster/dht: Wind open to all
subvols) posted (#1) for review on master by N Balachandran

--- Additional comment from Worker Ant on 2018-04-06 06:46:22 EDT ---

REVIEW: https://review.gluster.org/19831 (cluster/dht: Handle file migrations
when brick down) posted (#1) for review on master by N Balachandran

--- Additional comment from Worker Ant on 2018-04-11 09:19:03 EDT ---

COMMIT: https://review.gluster.org/19827 committed in master by "Shyamsundar
Ranganathan" <srangana at redhat.com> with a commit message- cluster/dht: Wind
open to all subvols

dht_opendir should wind the open to all subvols
whether or not local->subvols is set. This is
because dht_readdirp winds the calls to all subvols.

Change-Id: I67a96b06dad14a08967c3721301e88555aa01017
updates: bz#1564198
Signed-off-by: N Balachandran <nbalacha at redhat.com>

--- Additional comment from Worker Ant on 2018-04-12 22:27:57 EDT ---

COMMIT: https://review.gluster.org/19831 committed in master by "Raghavendra G"
<rgowdapp at redhat.com> with a commit message- cluster/dht: Handle file
migrations when brick down

The decision as to which node would migrate a file
was based on the gfid of the file. Files were divided
among the nodes for the replica/disperse set. However,
if a brick was down when rebalance started, the nodeuuids
would be saved as NULL and a set of files would not be migrated.

Now, if the nodeuuid is NULL, the first non-null entry in
the set is the node responsible for migrating the file.

Change-Id: I72554c107792c7d534e0f25640654b6f8417d373
fixes: bz#1564198
Signed-off-by: N Balachandran <nbalacha at redhat.com>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1553677
[Bug 1553677] [Remove-brick] Many files were not migrated from the
decommissioned bricks; commit results in data loss
https://bugzilla.redhat.com/show_bug.cgi?id=1564198
[Bug 1564198] [Remove-brick] Many files were not migrated from the
decommissioned bricks; commit results in data loss
https://bugzilla.redhat.com/show_bug.cgi?id=1566820
[Bug 1566820] [Remove-brick] Many files were not migrated from the
decommissioned bricks; commit results in data loss
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list