[Bugs] [Bug 1811284] New: Data loss during rebalance

Sat Mar 7 07:42:53 UTC 2020

https://bugzilla.redhat.com/show_bug.cgi?id=1811284

            Bug ID: 1811284
           Summary: Data loss during rebalance
           Product: GlusterFS
           Version: 6
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: distribute
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: kompastver at gmail.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Created attachment 1668276
  --> https://bugzilla.redhat.com/attachment.cgi?id=1668276&action=edit
gluster volume get vol1 all

Description of problem:
We have quite a big glusterfs cluster with 37 billion files and 75TB used
space.
All these files were stored on a distributed-replication cluster on six servers
and glusterfs v5.5.
At some point, we decided to expand our cluster and orderer yet another six
servers.
After some sort of smoke test (which included expanding a cluster) of glusterfs
v6.8 on our stage environment, we upgraded our production (op.version still
configured to use v5.5). After several days we started expanding the cluster
and a couple of days everything looked fine, but then we found out that some
files are absent. We stopped rebalance immediately, and at that moment,
progress was around 10%.
The only application uses this storage only adds new files and never deletes
them, and the only thing that was changed in the last days is started rebalance
process and the new version of glusterfs. So the main suspected is glusterfs.
After checking files, we found out that about 2% of them were lost. We can
restore some of them from a backup, which was made before the upgrade and
rebalance, but some of them lost forever.
Unfortunately, there are no relevant errors in log files and no coredump files
on any storage node.
How can I help to find a reason of data loss?

Version-Release number of selected component (if applicable):
6.8

How reproducible:
Unable to reproduce on our stage, but on production, it reproduces

Steps to Reproduce:
1. create distributed-replication cluster with replications factor 2
2. mount volume and copy files to it
2. add new servers to a pool: gluster peer probe ..
3. expand cluster: gluster volume add-brick my-vol srv6:/br srv7/br
4. invoke rebalance: gluster volume rebalance my-vol start
5. check all files exist

Actual results:
some files disappeared

Expected results:
all files exist

Additional info:

~ $ sudo gluster volume info

Volume Name: vol1
Type: Distributed-Replicate
Volume ID: fb35a90e-5174-466e-bb66-39391e8e83b9
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: srv1:/vol3/brick
Brick2: srv2:/vol3/brick
Brick3: srv3:/vol3/brick
Brick4: srv4:/vol3/brick
Brick5: srv5:/vol3/brick
Brick6: srv6:/vol3/brick
Brick7: srv7:/var/lib/gluster-bricks/brick
Brick8: srv8:/var/lib/gluster-bricks/brick
Brick9: srv9:/var/lib/gluster-bricks/brick
Brick10: srv10:/var/lib/gluster-bricks/brick
Brick11: srv11:/var/lib/gluster-bricks/brick
Brick12: srv12:/var/lib/gluster-bricks/brick
Options Reconfigured:
cluster.self-heal-daemon: enable
cluster.rebal-throttle: normal
performance.readdir-ahead: off
transport.address-family: inet6
performance.io-thread-count: 64
nfs.disable: on
performance.io-cache: on
performance.quick-read: off
performance.parallel-readdir: on
performance.client-io-threads: off
features.sdfs: enable
performance.read-ahead: off
client.event-threads: 4
server.event-threads: 32

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.