[Gluster-users] Gluster 3.6.9 missing files during remove migration operations

Wed Apr 20 23:55:40 UTC 2016

Hi,

I'm running gluster 3.6.9 on ubuntu 14.04 on a single test server (under Vagrant and VirtualBox), with 4 filesystems (in addition to the root), 2 of which are xfs directly on the disk, and the other 2 are xfs on an LVM config - the scenario I'm testing for is migration of our production gluster to add LVM so that we can use the snapshot features in 3.6 to implement offline backups.

On my test machine, I configured a volume with replica 2 and 2 bricks (with both bricks on the same server). I then started and mounted the volume back onto the same server under /mnt and populated /mnt with a 3 level deep hierarchy of 16 directories, and in each the leaf directories added 10 files of 1kB. So there are 40960 files in the filesystem (16x16x16x10) named like a/b/c/abc.0

For my first test, I did a "replace-brick commit force" to swap the first brick in my config with a new brick on one of the xfs on LVM filesystems. This resulted in the /mnt filesystem appearing empty until I manually started a full heal on the volume after which the files and directories started to re-appear on the mounted filesystem - after the heal completed, everything looked OK, but that's not going to work for our production systems. This appeared to be the suggestion from https://www.gluster.org/pipermail/gluster-users/2012-October/011502.html for a replicated volume

For my second attempt, I rebuilt the test system from scratch, built and mounted the gluster volume the same way and populated it with the same test file configuration. I then did a volume add-brick and added both of the xfs on LVM filesystems to the configuration. The directory tree was copied to the new bricks, but no files were moved. I then did volume remove-brick on the 2 initial bricks and the system started migrating the files to the new filesystems. This looked more promising, but during the migration operation, I ran find /mnt -type f | wc -l a number of times and on one of those checks, the number of files was 39280 instead of 40960 - I wasn't able to observe exactly which files were missing, I ran the command again immediately and it reported 40960 files every other time during the migration.

Is this expected behavior, or have I stumbled on a bug?

Is there a better workflow for completing this migration?

The production system runs in AWS and has 6 gluster servers over 2 availability zones, each of which has 1x600GB brick on an EBS volume, which are configured into a single 1.8TB volume with replication across the availability zones. We are planning on creating the new volumes with about 10% headroom left in the LVM config for holding snapshots, and hoping we can implement a backup solution by doing a gluster snapshot, followed by an EBS snapshot to get a consistent point in time offline backup (and then delete the gluster snapshot once the EBS snapshot has been taken). I haven't yet figured out the details of how we would restore from the snapshots (I can test that scenario once I have a working local test migration procedure and can migrate our test environment in AWS to support snapshots).

Thanks,
Bernard.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4112 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160420/8d7330b2/attachment.p7s>