[Gluster-users] Best practices?

Wed Jan 25 15:14:40 UTC 2012

gluster-users-bounces at gluster.org wrote on 01/24/2012 10:10:15 AM:
>
> On Tue, Jan 24, 2012 at 09:11:01AM -0600, Greg_Swift at aotx.uscourts.gov
wrote:
> > We have to have large numbers of volumes (~200).  Quick run down to
give
> > context.
> >
> > Our nodes would have around 128TB of local storage from several 32TB
raid
> > sets.  We started with ext4, so had a 16TB maximum.
>
> Aside: http://blog.ronnyegner-consulting.de/2011/08/18/ext4-and-
> the-16-tb-limit-now-solved/

kewl

> > So we broke it down
> > into nice even chunks of 16TB, thus 8 file systems. Our first attempt
was
> > ~200 volumes all using the 8 bricks per node (thus 1600 process/ports)
> ...
> > We had issues, and Gluster recommended
> > reducing our process/port count.
>
> So just checking I understand, the original configuration was:
>
> /data1/vol1 .. /data1/vol200
> ...
> /data8/vol1 .. /data8/vol200

correct

>
> Terminology issue: isn't each serverN:/dirM considered a separate 'brick'
to
> Gluster?  I would have thought that configuration would count as 1600
bricks
> per node (but with groups of 200 bricks sharing 1 underlying filesystem)

You are right... i should have said 1600 bricks/process/ports

For the sake of the conversation I think this expansion to the brick
definition is

Brick: A unique file system on a single storage node that runs a single
glusterfsd process and opens a single listening TCP port. (1 brick = 1
process = 1 port).  A combination of 1 or more bricks compromise a volume.

>
> > First we dropped down to only using 1 brick per volume per node, but
this
> > left us in a scenario of managing growth
>
> Like this?
>
> /data1/vol1   .. /data1/vol25
> /data2/vol26  .. /data2/vol50
> ...
> /data8/vol175 .. /data8/vol200

yes

> I see, so you have to assign the right subset of volumes to each
filesystem.
> I guess you could shuffle them around using replace-brick, but it would
be a
> pain.

very much so

> > So we determined to move to XFS to reduce from 8 partitions
> > down to 2 LVs.  Each would be 64TB each
>
> /data1/vol1   .. /data1/vol200
> /data2/vol1   .. /data2/vol200
>
> i.e. 400 ports/processes/(bricks?) per server.

That was the plan

> > We then ran into some performance
> > issues and found we had not tuned the XFS enough, which also deterred
us
> > from pushing forward with the move.
>
> I don't have any experience with XFS, but the Gluster docs do recommend
it
> as the one most heavily tested.
>
> I saw an old note here about tuning XFS to include extended attributes in
> the inode:
> http://www.gluster.org/community/documentation/index.php/
> Guide_to_Optimizing_GlusterFS
> (although the values shown seem to be defaults to mkfs.xfs nowadays)
>
> Did you find any other tuning was required?

We didn't do much tuning at the front end, which seems to have potentially
been a problem.  After the fact we did add some tuning such as disabling
barriers, atime and diratime.

> This is all extremely helpful - many thanks for sharing your experiences.

>
> BTW I am just in the process of setting up two test systems here.
Somewhat
> smaller than yours, but they are based on this chassis:
> http://www.xcase.co.uk/24-bay-Hotswap-rackmount-chassis-norco-
> RPC-4224-p/case-xcase-rm424.htm
> with Hitachi low-power 3TB drives.
>

thats pretty kewl.. to bad it doesn't do SFF hard drives... that would be
awesome.

-greg