[Gluster-users] Best practices?

Tue Jan 24 14:04:52 UTC 2012

On Mon, Jan 23, 2012 at 03:54:45PM -0600, Greg_Swift at aotx.uscourts.gov wrote:
> Its been talked about a few times on the list in abstract but I can give
> you one lesson learned from our environment.
> 
> the volume to brick ratio is a sliding scale.  you can can have more of
> one, but then you need to have less of the other.

This is interesting, because the examples aren't entirely clear in the 
documentation. At
http://download.gluster.com/pub/gluster/glusterfs/3.2/Documentation/IG/html/sect-Installation_Guide-Installing-Source.html
it says:

"Note

You need one open port, starting at 38465 and incrementing sequentially for
each Gluster storage server, and one port, starting at 24009 for each
bricks.  This example opens enough ports for 20 storage servers and three
bricks."

[presumably means three bricks *per server*?]

with this example:

$ iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24011 -j ACCEPT 
$ iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 111 -j ACCEPT 
$ iptables -A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 111 -j ACCEPT 
$ iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 38465:38485 -j ACCEPT
$ service iptables save
$ service iptables restart

So there's one range for bricks, and one range for servers (here they seem
to have given enough for 21 servers)

Now you point out that the number of volumes needs to be considered as well,
which makes sense if each brick can only belong to one volume.

> 24 bricks per node per volume
> 100 volumes
> ---------
> = 2400 running processes and 2400 ports per node

So that's 2400 bricks per node.

It seems to me there are a couple of ways I could achieve this:

(1) the drive mounted on /mnt/sda1 could be exported as 100 bricks
/mnt/sda1/chunk1
/mnt/sda1/chunk2
...
/mnt/sda1/chunk100

and repeated for each of the other 23 disks in that node.

If I build the first volume comprising all the chunk1's, the second volume
comprising all the chunk2's, and so on, then I'd have 100 volumes across all
the disks. Furthermore, I think this would allow each volume to grow as much
as it wanted, up to the total space available, is that right?

(2) I could organise the storage on each server into a single RAID block,
and then divide it into 2400 partitions, say 2400 LVM logical volumes.

Then the bricks would have to be of an initial fixed size, and each volume
would not be able to outgrow its allocation without resizing its brick's
filesystems (e.g.  by growing the LVM volumes).  Resizing a volume would be
slow and painful.

Neither looks convenient to manage, but (2) seems worse.

> More process/ports means more potential for ports in use, connectivity
> issues, file use limits (ulimits), etc.
> 
> thats not the only thing to keep in mind, but its a poorly documented one
> that burned me so :)

So if you don't mind me asking, what was your solution? Did you need large
numbers of volumes in your application?

Aside: it will be interesting to see how gluster 3.3's object storage API
handles this (from the documentation, it looks like you can create many
containers within the same volume)

The other concern I have regarding making individual drives be bricks is how
to handle drive failures and replacements.

For example, suppose I have this distributed replicated volume:

  server1:/disk1 server2:/disk1 server1:/disk2 server2:/disk2

Then I notice that server1:/disk1 is about to fail (SMART errors perhaps).
How can I take it out of service? The documentation says at
http://download.gluster.com/pub/gluster/glusterfs/3.2/Documentation/AG/html/sect-Administration_Guide-Managing_Volumes-Shrinking.html

"When shrinking distributed replicated and distributed striped volumes, you
need to remove a number of bricks that is a multiple of the replica or
stripe count.  For example, to shrink a distributed striped volume with a
stripe count of 2, you need to remove bricks in multiples of 2 (such as 4,
6, 8, etc.).  In addition, the bricks you are trying to remove must be from
the same sub-volume (the same replica or stripe set)."

Obviously I don't want to take out both server1:/disk1 and server2:/disk1,
because I'd lose access to half my data.

So the only other command I can see is replace-brick. This suggests I need
to have at least one spare drive slot (or a hot-spare drive) in the server1
chassis for the replace-brick operation to work onto. And if I do have
/mnt/sda1/chunk1..100, I would have to do replace-brick 100 times.

Is that correct, or have I misunderstood? Is there some other way to fail or
disable a single brick or drive, whilst still leaving access to its replica
partner?

Regards,

Brian.