<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>We currently have a Gluster array of three baremetal servers in a
Replicate 1x3 configuration. This single brick has about 1.1TB of
data and is configured for 3.7 TB of total space. This array is
mostly hosting mail in Maildir format, although we'd like it to
also host some Proxmox VMs - the problem with doing that is that
the performance of the Gluster array is so slow that booting VMs
from Gluster makes Proxmox time out! We've instead started
experimenting with using Gluster's NFS server to host the VMs
which is much faster, but there are obvious issues with stability.
We're not really hosting anything important yet, this is still an
experiment. Except for all our mail, of course.<br>
</p>
<p>The e-mail performance isn't spectacularly fast, but mostly
bearable at the moment. <br>
</p>
<p>The real meat of this post however, is "What do we do about
this?" I figured that I had built a slow RAID configuration (disk
utilization was very high), so I took down one of the Gluster
nodes and rebuilt it as a RAID 0 array. This meant starting again
with a completely empty disk, but after rebuilding the node, and
starting the volume heal, it absolutely slaughtered performance.
Our mail server had gotten so slow as to make webmail unusable.
The process to heal the volume takes <b>days</b> to move 1.1 TB
of data and we couldn't just let it run with performance that bad,
so I stopped the Gluster daemon during the day and only ran it at
night. It took two whole weeks to completely heal the volume in
this fashion, even when allowing the heal to run over the weekend
for two days straight. <br>
</p>
<p>So what happens when we add more Gluster nodes to this array? Or
if we wanted to upgrade the hardware in the array in any way? Or
if I wanted to make any other changes to the array? It seems that
first, Gluster's promise of high availability is "things will keep
working, but they'll be so slow in the meantime that nobody <b>wants</b>
to use the services built on top of it", and the same is true when
you have to take a node offline for an extended period of time and
you have to heal the array again. <br>
</p>
<p>This is a serious issue with the performance of heal operations.
What can I do to fix it?<br>
</p>
</body>
</html>