[Bugs] [Bug 1566820] New: [Remove-brick] Many files were not migrated from the decommissioned bricks; commit results in data loss
bugzilla at redhat.com
bugzilla at redhat.com
Fri Apr 13 04:13:51 UTC 2018
https://bugzilla.redhat.com/show_bug.cgi?id=1566820
Bug ID: 1566820
Summary: [Remove-brick] Many files were not migrated from the
decommissioned bricks; commit results in data loss
Product: GlusterFS
Version: 3.12
Component: distribute
Severity: high
Priority: high
Assignee: bugs at gluster.org
Reporter: nbalacha at redhat.com
CC: bugs at gluster.org, rhs-bugs at redhat.com,
storage-qa-internal at redhat.com, tdesala at redhat.com
Depends On: 1553677, 1564198
+++ This bug was initially created as a clone of Bug #1564198 +++
+++ This bug was initially created as a clone of Bug #1553677 +++
Description of problem:
=======================
Many files were not migrated from the decommissioned bricks; commit results in
data loss.
Version-Release number of selected component (if applicable):
3.12.2-5.el7rhgs.x86_64
How reproducible:
Reporting at first occurrence
Steps to Reproduce:
===================
1) Create a x3 volume with brick-mux enabled and start it.
2) FUSE mount it on multiple clients.
3) From Client-1 : run script to create folders and files continuously
From client-2 : start linux kernel untar
From client-3 : while true;do find;done
From client-4 : while true;do ls -lRt;done
4) While step-3 is in-progress, killed server-1 brick process using kill -9
<pid>.
As brick mux is enabled killing single brick on the server using kill -9 would
take down all the bricks on the node.
5) Now, add 3 bricks to the volume and after few secs immediately start
removing old bricks.
6) Wait for remove-brick to complete.
Actual results:
===============
Many files were not migrated from the decommissioned bricks; commit results in
data loss.
Expected results:
=================
Remove-brick operation should migrate all the files from the decommissioned
brick.
RCA:
The logs from the previous failed runs indicate 2 problems:
1. At least one process could not read directories because the first_up_subvol
was not in the list of local_subvols for the process
2.Since a brick was down, some files would not be migrated if the gfid hashed
to that node-uuid
--- Additional comment from Worker Ant on 2018-04-05 12:17:33 EDT ---
REVIEW: https://review.gluster.org/19827 (cluster/dht: Wind open to all
subvols) posted (#1) for review on master by N Balachandran
--- Additional comment from Worker Ant on 2018-04-06 06:46:22 EDT ---
REVIEW: https://review.gluster.org/19831 (cluster/dht: Handle file migrations
when brick down) posted (#1) for review on master by N Balachandran
--- Additional comment from Worker Ant on 2018-04-11 09:19:03 EDT ---
COMMIT: https://review.gluster.org/19827 committed in master by "Shyamsundar
Ranganathan" <srangana at redhat.com> with a commit message- cluster/dht: Wind
open to all subvols
dht_opendir should wind the open to all subvols
whether or not local->subvols is set. This is
because dht_readdirp winds the calls to all subvols.
Change-Id: I67a96b06dad14a08967c3721301e88555aa01017
updates: bz#1564198
Signed-off-by: N Balachandran <nbalacha at redhat.com>
--- Additional comment from Worker Ant on 2018-04-12 22:27:57 EDT ---
COMMIT: https://review.gluster.org/19831 committed in master by "Raghavendra G"
<rgowdapp at redhat.com> with a commit message- cluster/dht: Handle file
migrations when brick down
The decision as to which node would migrate a file
was based on the gfid of the file. Files were divided
among the nodes for the replica/disperse set. However,
if a brick was down when rebalance started, the nodeuuids
would be saved as NULL and a set of files would not be migrated.
Now, if the nodeuuid is NULL, the first non-null entry in
the set is the node responsible for migrating the file.
Change-Id: I72554c107792c7d534e0f25640654b6f8417d373
fixes: bz#1564198
Signed-off-by: N Balachandran <nbalacha at redhat.com>
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1553677
[Bug 1553677] [Remove-brick] Many files were not migrated from the
decommissioned bricks; commit results in data loss
https://bugzilla.redhat.com/show_bug.cgi?id=1564198
[Bug 1564198] [Remove-brick] Many files were not migrated from the
decommissioned bricks; commit results in data loss
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list