[Gluster-users] GlusterFS 3.1 on Amazon EC2 Challenge

Tue Oct 26 13:11:03 UTC 2010

You keep missing my central point. You wrote: "As the fullest EBS device
gets to 80%, **using snapshot/restore techniques**, replace them with 250 gb
EBS devices."

No.

My experience says: Do not use snapshot/restore techniques to grow a
glusterfs brick on EBS. Don't use rsync either, even rsync3 -X. In principle
these should work fine but we've found that rsync takes too long to be
useful (at least on xfs with all the extended attributes) and an EBS volume
reconstituted from a snapshot is so slow/bursty while it pages in that
glusterfs becomes unreliable. Maybe if you don't care if your clients and
servers hang for a while during the transition, the snapshot is fine... but
if you don't care about keeping your production environment up or
performant, all sorts of options are available.

To grow the bricks in a replicated volume, replace one of the bricks with a
larger EMPTY VOLUME and do a recursive scan of /mnt/gfs. When that's done,
replace the other brick with a larger EMPTY VOLUME and do another recursive
scan. This allows you to grow the volume without downtime, though it means
you are not HA during the transition. We've discussed using 3-way
replication to be able to do this process while remaining HA, but frankly
2-way replication already imposes more performance overhead than we can
really accept.

As for everything else you are doing, I haven't tried it yet so I have no
relevant experience.

Thanks,

Barry

On Tue, Oct 26, 2010 at 8:23 AM, Gart Davis <gdavis at spoonflower.com> wrote:

> This is -very- helpful.
>
> So, if I understand you properly, I should focus on scaling -inside-
> my EBS devices first.
>
> What I should really do is create a gluster volume that starts with
> -lots- of 125 gb EBS device (in my case, 32 to achieve 2 TB of usable
> replicated storage).  I should rsync -a to this volume to ensure a
> roughly even distribution/replication of files.  As the fullest EBS
> device gets to 80%, using snapshot/restore techniques, replace them
> with 250 gb EBS devices.  Next time 500, Next time 1 tb.  Then start
> over again with 512 125gb EBS devices and another rsync -a, and
> repeat.
>
> Because Gluster is a zero metadata system, this should in theory scale
> to the horizon, with a quick scriptable upgrade every doubling, and
> one painful multi-day transition using rsync -a every 10x.
>
> Does this make sense?  What are the gotchas with this approach?
>
> Thanks for your insights on this!
>
> Gart
>
> On Mon, Oct 25, 2010 at 7:25 PM, Barry Jaspan <barry.jaspan at acquia.com>
> wrote:
> > Gart,
> >
> > I was speaking generally in my message because I did not know anything
> about
> > your actual situation (maybe because I did not read carefully). From this
> > message, I understand your goal to be: You have a "source EBS volume"
> that
> > you would like to replace with a gluster filesystem containing the same
> > data. Based on this, my personal recommendation (which carries no
> official
> > weight whatsoever) is:
> >
> > 1.  On your gluster fileservers, mount whatever bricks you want. It
> sounds
> > you want cluster/distribute over two cluster/replicate volumes over two
> 1TB
> > EBS volumes each, so put two 1TB bricks on each server and export them.
> >
> > 2. From the machine holding the source EBS volume, mount the gluster
> bricks
> > created in step 1 under a volfile that arranges them under
> > cluster/distribute and cluster/replicate as you wish.
> >
> > 3. rsync -a /source-ebs /mnt/gfs
> >
> > 4. Switch your production service to use /mnt/gfs.
> >
> > 5. rsync -a /source-ebs /mnt/gfs again to catch any stragglers. The
> actual
> > details of when/how to run rsync, whether to take down production, etc.
> > depend on your service, of course.
> >
> > On Mon, Oct 25, 2010 at 2:13 PM, Gart Davis <gdavis at spoonflower.com>
> wrote:
> >>
> >> My priincipal concerns with this relate to Barry's 3rd bullet: Gluster
> >> does not rebalance evenly, and so this solution will eventually bounce
> >> off the roof and lock up.
> >
> > We had a replicate volume. We added distribute on top of it, added a
> > subvolume (which was another replicate volume), and used gluster's
> > "rebalance" script which consists of removing certain extended
> attributes,
> > renaming files, and copying them back into place. The end result was that
> > not very much data got moved to the new volume. Also, that approach to
> > rebalancing has inherent race conditions. The best you can do to add more
> > storage space to an existing volume is to set your min-free-disk low
> enough
> > (perhaps 80%) so that each time a new file is added that should go to the
> > old full brick gluster will instead create a link file on the old brick
> > pointing to the new brick, and put the real data on the new brick. This
> > imposes extra link-following overhead, but I believe it works.
> >
> >> Forgive my naivete Barry, when you say 'just use larger replicate
> >> volumes instead of distribute', what does that mean?
> >
> > After our fiasco trying to switch from a single replicate volume to
> > distribute over two replicates (having all the problems I just
> described),
> > we just went back to a single replicate volume, and increased our EBS
> volume
> > sizes. They were only 100GB, and we made them 500GB. This worked because
> EBS
> > allows it. If/when we need the bricks to be bigger than 1TB... well I
> hope
> > gluster has improved its capabilities by that point.  If not, we might
> use
> > lvm or whatever on the glusterfs server to make multple ebs volumes look
> > like >1TB bricks.
> >
> > Barry
> >
> >>
> >>  Are you running
> >> multiple 1 tb EBS bricks in a single 'replica 2' volume under a single
> >> file server?  My recipe is largely riffing off Josh's tutorial.
> >> You've clearly found a recipe that you're happy to entrust production
> >> data to... how would you change this?
> >>
> >> Thanks!
> >>
> >> Gart
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
> >
> >
> >
> > --
> > Barry Jaspan
> > Senior Architect | Acquia
> > barry.jaspan at acquia.com | (c) 617.905.2208 | (w) 978.296.5231
> >
> > "Get a free, hosted Drupal 7 site: http://www.drupalgardens.com"
> >
> >
>

-- 
Barry Jaspan
Senior Architect | Acquia <http://acquia.com>
barry.jaspan at acquia.com | (c) 617.905.2208 | (w) 978.296.5231

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com"