[Gluster-users] Does gluster make use of a multicore setup? Hardware recs.?

Wed Apr 27 10:12:31 UTC 2011

Gluster is threaded and will take advantage of multiple CPU hardware
and memory, however having a fast disk subsystem is far more
important.  Having a lot of memory with huge bricks isn't very
necessary IMO because even with 32g of ram your cache hit ratio across
huge 30+tb bricks is so insanely small that it doesn't really make any
real world difference.

I have a 240tb cluster over 16 bricks (4 physical servers each
exporting 4 30tb bricks) and another 120tb cluster over 8 bricks (4
physical servers each exporting 2 30tb bricks).

Hardware wise both my clusters are basically the same.  Supermicro
SC846E1-R900B 4u 24 drive chassis, dual 2.2ghz quadcore xeons, 8g ram,
3ware 9690sa-4i4e SAS raid controller, 24 x Seagate 1.5tb SATA 7200rpm
hard drives in each chassis.  Each brick is a raid6 over all 24 drives
per chassis.  I daisy chain the chassis's together via SAS cables.  So
on my larger cluster I daisy chain 3 more 24 drive chassis's off the
back of the head node.  The smaller cluster I only daisy chain one
chassis off the back.

Lots of people prefer not to do raid and have gluster handle the file
replication (replicate) between bricks.  My problem with that is with
a huge amount of files (I have nearly 100 million files on my larger
cluster) that a rebuild (ls -alR) takes 2-3 weeks.  And since those
Seagate drives are crap (I lose maybe 1-3 drives a month!) I would
constantly be rebuilding almost all the time.  Using hardware raid
makes life much easier for me.  Instead of having 384 bricks I have
16.  When I lose a drive I just hot swap it and let the 3ware
controller rebuild the raid6 array.  The rebuild time on the 3ware
depends on the workload but its anywhere from 2-5 days normally.  One
time I lost a drive in the middle of a rebuild (so one failed and one
in a rebuild state) and was able to hotswap the new failed drive and
it correctly rebuilt the array with two failed drives without any
problems or downtime on the cluster.  Win!

So I'm a big fan of hardware raid, especially the 3ware controllers.
They handle the slow non-enterprise Seagate drives very well.  I've
tried LSI, Dell Perc 5e/6e, and Supermicro (LSI) controller and they
all had issues with drive timeouts.  A few recommendations when using
the 3ware controllers, disable SMARTD in Linux (it pisses off the
3ware controller) and the 3ware controller keeps an eye on the SMART
on each disk anyway, set the block readahead in linux to 16384
(/sbin/blockdev --setra 16384 /dev/sdX), upgrade the firmware on the
3ware controller to the newest version from 3ware, use the newest
3ware drives and not the included driver bundled with whatever linux
distro you use, spend the $100 and make sure you get the optional
battery backup module for the controller, and use nagios to check your
raid status!  Oh, and if you use desktop commodity hard drives, make
sure you have a bunch of spares on hand.  :)

Even with hardware raid I still use gluster's replication to provide
me redundancy so I can do patches and system maintenance without
downtime to my clients.  I mirror bricks between head nodes and then
use distribute to glue all the replicated bricks together.

I have two Dell 1950 1u public facing webservers (Linux/Apache) using
the gluster fuse mount connected via a private backend network to my
smaller cluster.  My average file request size is around 3megs (10-15
requests per second), and i've been able to push 800mbit/sec of http
traffic from those two clients.  Might have been higher but my
firewall only has gigabit ethernet which was basically saturated at
that point.   I only use a 128meg gluster client cache because I'm
feeding my CDN so the requests are very random and I very rarely see
two requests for the same file.  Thats pretty awesome random read
performance if you ask me considering the hardware.  I start getting
uncomfortable with anymore than 600mbit/sec of traffic as the service
read times off the bricks on the gluster servers start getting quite
high.  Those 1.5tb Seagate hard drives are cheap, $80 a drive, but
they're not very fast at random reads.

Hope that helps!

liam

On Wed, Apr 27, 2011 at 1:39 AM, Martin Schenker
<martin.schenker at profitbricks.com> wrote:
> Hi all!
>
> I'm new to the Gluster system and tried to find answers to some simple
> questions (and couldn't find the information with Google etc.)
>
> -does Gluster spread it's cpu load across a multicore environment? So does
> it make sense to have 50 core units as Gluster server? CPU loads seem to go
> up quite high during file system repairs so spreading / multithreading
> should help? What kind of CPUs are working well? How much memory does help
> the preformance?
>
> -Are there any recommendations for commodity hardware? We're thinking of 36
> slot 4U servers, what kind of controllers DO work well for IO speed? Any
> real life experiences? Does it dramatically improve the performance to
> increase the number of controllers per disk?
>
> The aim is for  a ~80-120T file system with 2-3 bricks.
>
> Thanks for any feedback!
>
> Best, Martin
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>