[Gluster-users] GlusterFS as virtual machine storage

Wed Aug 23 18:50:40 UTC 2017

On 8/21/2017 1:09 PM, Gionatan Danti wrote:
>
> Hi all,
> I would like to ask if, and with how much success, you are using 
> GlusterFS for virtual machine storage.
>
> My plan: I want to setup a 2-node cluster, where VM runs on the nodes 
> themselves and can be live-migrated on demand.
>
> I have some questions:
> - do you use GlusterFS for similar setup?

2 node plus Arbiter. You NEED the arbiter or a third node. Do NOT try 2 
node with a VM.

We also use sharding to speed up heals.

3 node would be even better but 2 node + Arbiter is faster.

Note the arbiter doesn't have to be great kit. Its mostly memory and a 
small amount of Hard Disk space and even on older systems, you can throw 
in a cheap low capacity SSD drive. You probably have a bunch of those 
lying around.

You could probably get away with using containers for the arbiter and 
'share' an arbiter "host" among clusters. We haven't had a chance to try 
that yet, but with net=host and a unique IP per container, I don't see 
why it would be an issue.

> - if so, how do you feel about it?

Very happy. Reasonable and reliable performance (compared to other 
distributed storage). Gluster does not have the performance of a direct 
attached SSD drive but none of the distributed storage options can do 
that, unless they cheat with heavy buffering and async writes which is 
problematic on VM files, if something bad happens.

> - if a node crashes/reboots, how the system re-syncs? Will the VM 
> files be fully resynchronized, or the live node keeps some sort of 
> write bitmap to resynchronize changed/written chunks only? (note: I 
> know about sharding, but I would like to avoid it);

Without sharding any reheal after an outage (planned or otherwise) will 
take a LOT longer (because you have to sync the entire VM file which in 
our case is 20GB to 150GB per VM affected). That can take quite a while 
even with a fast network.
With sharding in many cases the reheal after maintenance amounts to a 
'pause' and is almost a non-event, because it only has to heal the few 
shards that are out of sync.

The cool thing about gluster in old-school replication node, is the VM 
files are all there on each node. There is no master index that can get 
corrupted with your bits spread out among the various nodes.
Of course with sharding, you would have to re-assemble the file, but 
that has been discussed on this list and we have tested that several 
times on even large VMs, by removing a brick and having a tech 
re-assemble and check the md5sum to make sure we have a working VM file.

>
> - finally, how much stable is the system?

We were on 3.4 for years for some old clusters, never had a serious 
problem but we had to be really careful during upgrades/reboots because 
they were 2 node systems and if you didn't do things precisely you ended 
up with a split-brain. On the rare crash event, we would often pick a 
good image from one of the nodes and designate it as the source.

We are on 3.10 now and the arbiter+sharding go a long way in solving 
that issue.

-wk