[Gluster-users] Disastrous performance with rsync to mounted Gluster volume.

Thu Apr 23 17:54:35 UTC 2015

Hello everyone.

I've built a replicated Gluster cluster (volume info shown below) of two 
Dell servers on a 1 GB switch, plus a second NIC on each server for 
replication data. But when I try to copy our mail store from our backup 
server onto the Gluster volume, I've been having nothing but trouble.

I may have messed this right up the first time, as I just used rsync to 
copy all the files to the Linux filesystem on the primary Gluster 
server, instead of copying the data to an NFS or Gluster mount. 
Attempting to get Gluster to synchronize the files to the second Gluster 
server hasn't worked out very well at all, with about half the data 
actually copied to the second Gluster server. Attempts to force Gluster 
to synchronize this data have all failed (Gluster appears to think the 
data is already synchronized). This might be the best way of 
accomplishing this in the end, but in the meantime I've tried a 
different tack.

Now, I'm trying to mount the Gluster volume over the network from the 
backup server, using NFS (the backup server doesn't and can't have a 
compatible version of GlusterFS on it, I plan to nuke it and install an 
OS that does support it, but first we have to get this mail store copied 
over!). Then I use rsync to copy only missing files to the NFS share and 
let Gluster do its own replication. This has been many, many times 
slower than just using rsync to copy the files, even considering the 
amount of data (439 GB). CPU usage on the Gluster servers is fairly 
high, with a server load value of about 4 on an 8 CPU system. Network 
usage is... well, not that high. Maybe topping about 50-70 Mbps. This 
same story is true whether I'm looking at the network usage for the 
primary, server-facing network or the secondary, Gluster-only network, 
so I don't think the bottleneck is there. Hard drive utilization peaks 
at around 40% but doesn't really stay that high.

One possible clue may lie in Gluster's logs. I see millions of log 
entries like this:

[2015-04-23 16:40:50.122007] I 
[afr-self-heal-entry.c:1909:afr_sh_entry_common_lookup_done] 
0-gv2-replicate-0: 
<gfid:912eec51-89dc-40ea-9dfd-072404d306a2>/1355401127.H542717P24276.pop.lightspeed.ca:2,: 
Skipping entry self-heal because of gfid absence
[2015-04-23 16:40:50.123327] I 
[afr-self-heal-entry.c:1909:afr_sh_entry_common_lookup_done] 
0-gv2-replicate-0: 
<gfid:912eec51-89dc-40ea-9dfd-072404d306a2>/1355413874.H20794P22730.pop.lightspeed.ca:2,: 
Skipping entry self-heal because of gfid absence
[2015-04-23 16:40:50.123705] I 
[afr-self-heal-entry.c:1909:afr_sh_entry_common_lookup_done] 
0-gv2-replicate-0: 
<gfid:912eec51-89dc-40ea-9dfd-072404d306a2>/1355420013.H176322P3859.pop.lightspeed.ca:2,: 
Skipping entry self-heal because of gfid absence
[2015-04-23 16:40:50.124030] I 
[afr-self-heal-entry.c:1909:afr_sh_entry_common_lookup_done] 
0-gv2-replicate-0: 
<gfid:912eec51-89dc-40ea-9dfd-072404d306a2>/1355429494.H263072P14676.pop.lightspeed.ca:2,: 
Skipping entry self-heal because of gfid absence
[2015-04-23 16:40:50.124423] I 
[afr-self-heal-entry.c:1909:afr_sh_entry_common_lookup_done] 
0-gv2-replicate-0: 
<gfid:912eec51-89dc-40ea-9dfd-072404d306a2>/1355436426.H973617P29804.pop.lightspeed.ca:2,: 
Skipping entry self-heal because of gfid absence

The size and growth of these logs is at the point where I have to cut 
them short every hour, or the /var partition fills up within a couple of 
days.

And finally, I have the gluster volume info:

root at nfs1:/brick1/gv2/www3# gluster vol info gv2

Volume Name: gv2
Type: Replicate
Volume ID: fb06a044-7871-4362-b134-fb97433f89f7
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: nfs1:/brick1/gv2
Brick2: nfs2:/brick1/gv2
Options Reconfigured:
nfs.disable: off

Any help removing myself from this mess would be greatly appreciated. :)