[Bugs] [Bug 1811284] New: Data loss during rebalance
bugzilla at redhat.com
bugzilla at redhat.com
Sat Mar 7 07:42:53 UTC 2020
https://bugzilla.redhat.com/show_bug.cgi?id=1811284
Bug ID: 1811284
Summary: Data loss during rebalance
Product: GlusterFS
Version: 6
Hardware: x86_64
OS: Linux
Status: NEW
Component: distribute
Severity: urgent
Assignee: bugs at gluster.org
Reporter: kompastver at gmail.com
CC: bugs at gluster.org
Target Milestone: ---
Classification: Community
Created attachment 1668276
--> https://bugzilla.redhat.com/attachment.cgi?id=1668276&action=edit
gluster volume get vol1 all
Description of problem:
We have quite a big glusterfs cluster with 37 billion files and 75TB used
space.
All these files were stored on a distributed-replication cluster on six servers
and glusterfs v5.5.
At some point, we decided to expand our cluster and orderer yet another six
servers.
After some sort of smoke test (which included expanding a cluster) of glusterfs
v6.8 on our stage environment, we upgraded our production (op.version still
configured to use v5.5). After several days we started expanding the cluster
and a couple of days everything looked fine, but then we found out that some
files are absent. We stopped rebalance immediately, and at that moment,
progress was around 10%.
The only application uses this storage only adds new files and never deletes
them, and the only thing that was changed in the last days is started rebalance
process and the new version of glusterfs. So the main suspected is glusterfs.
After checking files, we found out that about 2% of them were lost. We can
restore some of them from a backup, which was made before the upgrade and
rebalance, but some of them lost forever.
Unfortunately, there are no relevant errors in log files and no coredump files
on any storage node.
How can I help to find a reason of data loss?
Version-Release number of selected component (if applicable):
6.8
How reproducible:
Unable to reproduce on our stage, but on production, it reproduces
Steps to Reproduce:
1. create distributed-replication cluster with replications factor 2
2. mount volume and copy files to it
2. add new servers to a pool: gluster peer probe ..
3. expand cluster: gluster volume add-brick my-vol srv6:/br srv7/br
4. invoke rebalance: gluster volume rebalance my-vol start
5. check all files exist
Actual results:
some files disappeared
Expected results:
all files exist
Additional info:
~ $ sudo gluster volume info
Volume Name: vol1
Type: Distributed-Replicate
Volume ID: fb35a90e-5174-466e-bb66-39391e8e83b9
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: srv1:/vol3/brick
Brick2: srv2:/vol3/brick
Brick3: srv3:/vol3/brick
Brick4: srv4:/vol3/brick
Brick5: srv5:/vol3/brick
Brick6: srv6:/vol3/brick
Brick7: srv7:/var/lib/gluster-bricks/brick
Brick8: srv8:/var/lib/gluster-bricks/brick
Brick9: srv9:/var/lib/gluster-bricks/brick
Brick10: srv10:/var/lib/gluster-bricks/brick
Brick11: srv11:/var/lib/gluster-bricks/brick
Brick12: srv12:/var/lib/gluster-bricks/brick
Options Reconfigured:
cluster.self-heal-daemon: enable
cluster.rebal-throttle: normal
performance.readdir-ahead: off
transport.address-family: inet6
performance.io-thread-count: 64
nfs.disable: on
performance.io-cache: on
performance.quick-read: off
performance.parallel-readdir: on
performance.client-io-threads: off
features.sdfs: enable
performance.read-ahead: off
client.event-threads: 4
server.event-threads: 32
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list