[Gluster-users] How to correctly distribute OpenStack VM files...

Mon Aug 5 13:32:17 UTC 2013

----- Original Message -----

> From: "Gowrishankar Rajaiyan" <gsr at redhat.com>
> To: "Xavier Trilla" <xavier.trilla at silicontower.net>
> Cc: gluster-users at gluster.org
> Sent: Monday, August 5, 2013 9:01:57 AM
> Subject: Re: [Gluster-users] How to correctly distribute OpenStack VM
> files...

> On 08/02/2013 06:22 AM, Xavier Trilla wrote:

> > Hi,
> 

> > We have been playing for a while with GlusterFS (Now with ver 3.4). We are
> > running tests and playing with it to check if GlusterFS can be really used
> > as the distributed storage for OpenStack block storage (Cinder) as new
> > features in KVM, GlusterFS and OpenStack are pointing to GlusterFS as the
> > future of OpenStack open source block and object storage.
> 

> > But we’ve found a problem just when we started playing with GlusterFS… The
> > way distribute translator (DHT) balances the load. I mean, we understand
> > and
> > see the benefits of metadata less setup. Using hashes based on filenames
> > and
> > assigning a hash range to each brick is clever, reliable and fast, but from
> > our understanding there is a big problem when it comes to storing VM images
> > of a OpenStack deployment.
> 

> > I mean, OpenStack Block Storage (Cinder) assigns a name to each volume it
> > creates (GUID), so GlusterFS does a hash of the filename and decides in
> > which brick it should be stored. But as in this scenario we don’t have many
> > files (I mean, we would just have one big file per VM) we may end with a
> > really unbalanced storage.
> 

> > Let’s say we have a 4 bricks setup with DHT distribute, and we want to
> > store
> > 100 VMs there, so the ideal scenario would be:
> 

> > Brick1: 25 VMs
> 

> > Brick2: 25 VMs
> 

> > Brick3: 25 VMs
> 

> > Brick4: 25 VMs
> 

> > As VMs are IO intensive it’s really important to correctly balance the
> > load,
> > as each brick has a limited amount of IOPS, but as DHT is just based on a
> > filename HASH, we could end with something like the following scenario (Or
> > even worse):
> 

> > Brick1: 50 VMs
> 

> > Brick2: 10 VMs
> 

> > Brick3: 35 VMs
> 

> > Brick4: 5 VMs
> 

> > And if we scale this out, things may get even worse. I mean, we may end
> > with
> > almost all VM file in one or two bricks and all the other bricks almost
> > empty. And if we use growing VM disk image files like qcow2 the option
> > “min-free-disk” will not prevent all VMs disk image files being stored in
> > the same brick. So, I understand DHT works well for large amount of small
> > files, but for few big IO intensive files doesn’t seem to be a really good
> > solution… (I mean, we are looking for a solution able to handle around 32
> > bricks and around 1500 VM for the initial deployment and able to scale up
> > to
> > 256 bricks and 12000 VMs :/ )
> 

> > So, anybody has a suggestion about how to handle this? I mean so far we
> > only
> > see two options: Either using legacy unify translator with ALU scheduler or
> > either use cluster/stripe translator with a big block-size so at least load
> > gets balanced across all bricks in some way. But obviously we don’t like
> > unify as it needs a namespace brick, and using stripping seems to have an
> > impact on performance and really complicates backup/restore/recovery
> > strategies.
> 

> Another suggestion that you may want to try is, have your GlusterFS node also
> serve as OpenStack Cinder and use NUFA[1]

> ~shanks

> [1]
> http://gluster.org/community/documentation/index.php/Translators/cluster/nufa
Maybe. But before we explore solutions/alternatives, there needs to be further characterization of the problem. It is not clear to me if Xavier has actually seen this problem at 100 VMs or just an extrapolation of problem he is seeing with much smaller number of VMs. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130805/452414d7/attachment.html>