[Gluster-users] Gluster volume heal statistics aren't changing.

Thu Apr 14 21:51:34 UTC 2016

Hi everyone.

So, a few days ago, I installed another gluster server to our cluster to 
prevent split-brains. I told the server to do a self-heal operation, and 
sat back and waited while the performance of the cluster dropped 
dramatically and our customers all lost patience with us over the course 
of several days.

Now I see that the disk on the new node has filled somewhat, but 
apparently the self-heal process has stalled. This is what I see when I 
run the "volume heal statistics heal-count" command:

root at nfs3:/home/ernied# date
Thu Apr 14 13:14:00 PDT 2016
root at nfs3:/home/ernied# gluster volume heal gv2 statistics heal-count
Gathering count of entries to be healed on volume gv2 has been 
successful

Brick nfs1:/brick1/gv2
Number of entries: 475

Brick nfs2:/brick1/gv2
Number of entries: 190

Brick nfs3:/brick1/gv2
Number of entries: 36

root at nfs3:/home/ernied# date
Thu Apr 14 14:35:00 PDT 2016
root at nfs3:/home/ernied# gluster volume heal gv2 statistics heal-count
Gathering count of entries to be healed on volume gv2 has been 
successful

Brick nfs1:/brick1/gv2
Number of entries: 475

Brick nfs2:/brick1/gv2
Number of entries: 190

Brick nfs3:/brick1/gv2
Number of entries: 36

After an hour and 20 minutes, I see zero progress. How do I give this 
thing a kick in the pants to get moving?

Also, after reading a bit about Gluster tuning, I suspect I may have 
made a mistake in creating the bricks. I hear about how we should have 
pairs of bricks for faster access, but we've only got 1 brick replicated 
over 3 servers. Or maybe that's 3 bricks all named the same thing, I'm 
not sure. Here's what the "volume info" command shows:

root at nfs1:/home/ernied# gluster volume info

Volume Name: gv2
Type: Replicate
Volume ID: 3969e9cc-a2bf-4819-8c02-bf51ec0c905f
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: nfs1:/brick1/gv2
Brick2: nfs2:/brick1/gv2
Brick3: nfs3:/brick1/gv2
Options Reconfigured:
cluster.server-quorum-type: none
cluster.server-quorum-ratio: 51

We currently have about 618 GB of data shared on 3 6 TB RAID arrays. The 
data is nearly all e-mail, so a lot of small files and IMAP doing a lot 
of random read/write operations. Customers are not pleased with the 
speed of our webmail right now. Would creating a larger number of 
smaller bricks speed up our backend performance? Is there a way to do 
that non-destructively?