[Gluster-users] Replica 3 scale out and ZFS bricks

Thu Sep 17 10:15:54 UTC 2020

On 9/16/20 9:53 PM, Strahil Nikolov wrote:
> В сряда, 16 септември 2020 г., 11:54:57 Гринуич+3, Alexander Iliev <ailiev+gluster at mamul.org> написа:
> 
>  From what I understood, in order to be able to scale it one node at a
> time, I need to set up the initial nodes with a number of bricks that is
> a multiple of 3 (e.g., 3, 6, 9, etc. bricks). The initial cluster will
> be able to export a volume as large as the storage of a single node and
> adding one more node will grow the volume by 1/3 (assuming homogeneous
> nodes.)
> 
>      You can't add 1 node to a replica 3, so no - you won't get 1/3 with that extra node.

OK, then I guess I was totally confused on this point.

I'd imagined something like this would work:

   node1        node2        node3
+---------+  +---------+  +---------+
| brick 1 |  | brick 1 |  | brick 1 |
| brick 2 |  | brick 2 |  | brick 2 |
| brick 3 |  | brick 3 |  | brick 3 |
+---------+  +---------+  +---------+
                  |
                  v
   node1        node2        node3        node4
+---------+  +---------+  +---------+  +---------+
| brick 1 |  | brick 1 |  | brick 4 |  | brick 1 |
| brick 2 |  | brick 4 |  | brick 2 |  | brick 2 |
| brick 3 |  | brick 3 |  | brick 3 |  | brick 4 |
+---------+  +---------+  +---------+  +---------+

any# gluster peer probe node4
any# gluster volume replace-brick volume1 node2:/gfs/2/brick 
node4:/gfs/2/brick commit force
any# gluster volume replace-brick volume1 node3:/gfs/1/brick 
node4:/gfs/1/brick commit force
node2# umount /gfs/2 && mkfs /dev/... && mv /gfs/2 /gfs/4 && mount 
/dev/... /gfs/4 # or clean up the replaced brick by other means
node3# umount /gfs/1 && mkgs /dev/... && mv /gfs/1 /gfs/4 && mount 
/dev/... /gfs/4 # or clean up the replaced brick by other means
any# gluster volume add-brick volume1 node2:/gfs/4/brick 
node3:/gfs/4/brick node4:/gfs/4/brick

(Note: /etc/fstab or whatever mounting mechanism is used also needs to 
be updated after renaming the mount-points on node2 and node3.)

I played around with this in a VM setup and it seems to work, but maybe 
I'm missing something.

Even if this is supposed to work maybe it has other implications I'm not 
aware of, so I would be happy to be educated on this.

> 
> My plan is to use ZFS as the underlying system for the bricks. Now I'm
> wondering - if I join the disks on each node in a, say, RAIDZ2 pool and
> then create a dataset within the pool for each brick, the GlusterFS
> volume would report the volume size 3x$brick_size, because each brick
> shares the same pool and the size/free space is reported according to
> the ZFS pool size/free space.
> 
> I'm not sure about ZFS (never played with it on Linux), but in my systems I setup a Thinpool consisting on all HDDs in a striped way (when no Hardware Raid Controller is available) and then you setup thin LVs for each brick.
> In thin LVM you can define Virtual Size and this size is reported as the volume size (assuming that all bricks are the same in size).If you have 1 RAIDZ2 pool per Gluster TSP node, then that pool's size is the maximum size of your volume. If you plan to use snapshots , then you should set quota on the volume to control the usage.
> 
> How should I go about this? Should I create a ZFS pool per brick (this
> seems to have a negative impact on performance)? Should I set a quota
> for each dataset?
> 
> I would go with 1 RAIDZ2 pool with 1 dataset of type 'filesystem' per Gluster node . Quota is always good to have.
> 
> 
> P.S.: Any reason to use ZFS ? It uses a lot of memory .

Two main reasons for ZFS - node-level redundancy and compression.

I want to enable some node-level fault tolerance in order to avoid 
healing a failed node from scratch. From my experience so far healing 
(at least in our environment) is quite slow and painful.

Hardware RAID is not an option in our setup. With LVM mirroring we would 
be utilizing 50% of the physical space. We could go with mdadm+LVM, but 
it feels messier and AFAIK mdadm RAID6 is prone to the "write hole" 
problem (but maybe I'm outdated on this one).hunter86_bg at yahoo.com

Best regards,
--
alexander iliev