[Gluster-users] Gluster usage scenarios in HPC cluster management

Wed Mar 24 05:18:12 UTC 2021

Just add on, we are using gluster beside our main storage Lustre for k8s
cluster  .

On Wed, Mar 24, 2021 at 4:33 AM Ewen Chan <alpha754293 at hotmail.com> wrote:

> Erik:
>
> I just want to say that I really appreciate you sharing this information
> with us.
>
> I don't think that my personal home lab micro cluster environment may get
> that complicated enough where I have a virtualized testing/Gluster
> development setup like you have, but on the other hand, as I mentioned
> before, I am running 100 Gbps Infiniband so what I am trying to do/use
> Gluster for is quite different than what and how most people deploy/install
> Gluster for production systems.
>
> If I wanted to splurge, I'd get a second set of IB cables so that the high
> speed interconnect layer can be split so that jobs will run on one layer of
> the Infiniband fabric whilst storage/Gluster may run on another layer.
>
> But for that, I'll have to revamp my entire microcluster, so there are no
> plans to do that just yet.
>
> Thank you.
>
> Sincerely,
> Ewen
>
> ------------------------------
> *From:* gluster-users-bounces at gluster.org <
> gluster-users-bounces at gluster.org> on behalf of Erik Jacobson <
> erik.jacobson at hpe.com>
> *Sent:* March 23, 2021 10:43 AM
> *To:* Diego Zuccato <diego.zuccato at unibo.it>
> *Cc:* gluster-users at gluster.org <gluster-users at gluster.org>
> *Subject:* Re: [Gluster-users] Gluster usage scenarios in HPC cluster
> management
>
> > I still have to grasp the "leader node" concept.
> > Weren't gluster nodes "peers"? Or by "leader" you mean that it's
> > mentioned in the fstab entry like
> > /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0
> > while the peer list includes l1,l2,l3 and a bunch of other nodes?
>
> Right, it's a list of 24 peers. The 24 peers are split in to a 3x24
> replicated/distributed setup for the volumes. They also have entries
> for themselves as clients in /etc/fstab. I'll dump some volume info
> at the end of this.
>
>
> > > So we would have 24 leader nodes, each leader would have a disk serving
> > > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded,
> > > one is for logs, and one is heavily optimized for non-object expanded
> > > tree NFS). The term "disk" is loose.
> > That's a system way bigger than ours (3 nodes, replica3arbiter1, up to
> > 36 bricks per node).
>
> I have one dedicated "disk" (could be disk, raid lun, single ssd) and
> 4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just
> for the lock and has a single file.
>
> >
> > > Specs of a leader node at a customer site:
> > >  * 256G RAM
> > Glip! 256G for 4 bricks... No wonder I have had troubles running 26
> > bricks in 64GB RAM... :)
>
> I'm not an expert in memory pools or how they would be impacted by more
> peers. I had to do a little research and I think what you're after is
> if I can run gluster volume status cm_shared mem on a real cluster
> that has a decent node count. I will see if I can do that.
>
>
> TEST ENV INFO for those who care
> --------------------------------
> Here is some info on my own test environemnt which you can skip.
>
> I have the environment duplicated on my desktop using virtual machines and
> it
> runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache
> from the optimized volumes but other than that it is fine. In my
> development environment, the gluster disk is a 40G qcow2 image.
>
> Cache sizes changed from 8G to 100M to fit in the VM.
>
> XML snips for memory, cpus:
> <domain type='kvm' id='24'>
>   <name>cm-leader1</name>
>   <uuid>99d5a8fc-a32c-b181-2f1a-2929b29c3953</uuid>
>   <memory unit='KiB'>3268608</memory>
>   <currentMemory unit='KiB'>3268608</currentMemory>
>   <vcpu placement='static'>2</vcpu>
>   <resource>
> ......
>
>
> I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test
> compute node for my development environment.
>
> My desktop where I test this cluster stack is a beefy but not brand new
> desktop:
>
> Architecture:        x86_64
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> Address sizes:       46 bits physical, 48 bits virtual
> CPU(s):              16
> On-line CPU(s) list: 0-15
> Thread(s) per core:  2
> Core(s) per socket:  8
> Socket(s):           1
> NUMA node(s):        1
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               79
> Model name:          Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
> Stepping:            1
> CPU MHz:             2594.333
> CPU max MHz:         3000.0000
> CPU min MHz:         1200.0000
> BogoMIPS:            4190.22
> Virtualization:      VT-x
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            256K
> L3 cache:            20480K
> NUMA node0 CPU(s):   0-15
> <SNIP>
>
>
> (Not that it matters but this is a HP Z640 Workstation)
>
> 128G memory (good for a desktop I know, but I think 64G would work since
> I also run windows10 vm environment for unrelated reasons)
>
> I was able to find a MegaRAID in the lab a few years ago and so I have 4
> drives in a MegaRAID and carve off a separate volume for the VM disk
> images. It has a cache. So that's also more beefy than a normal desktop.
> (on the other hand, I have no SSDs. May experiment with that some day
> but things work so well now I'm tempted to leave it until something
> croaks :)
>
> I keep all VMs for the test cluster with "Unsafe cache mode" since there
> is no true data to worry about and it makes the test cases faster.
>
> So I am able to test a complete cluster management stack including
> 3-leader-gluster servers, an admin, and compute all on my desktop using
> virtual machines and shared networks within libivrt/qemu.
>
> It is so much easier to do development when you don't have to reserve
> scarce test clusters and compete with people. I can do 90% of my cluster
> development work this way. Things fall over when I need to care about
> BMCs/ILOs or need to do performance testing of course. Then I move to
> real hardware and play the hunger-games-of-internal-test-resources :) :)
>
> I mention all this just to show that the beefy servers are not needed
> nor the memory usage high. I'm not continually swapping or anything like
> that.
>
>
>
>
> Configuration Info from Real Machine
> ------------------------------------
>
> Some info on an active 3x3 cluster. 2738 compute nodes.
>
> The most active volume here is "cm_obj_sharded". It is where the image
> objects live and this cluster uses image objects for compute node root
> filesystems. I by hand changed the IP addresses (in case I made an
> error doing that).
>
>
> Memory status for volume : cm_obj_sharded
> ----------------------------------------------
> Brick : 10.1.0.5:/data/brick_cm_obj_sharded
> Mallinfo
> --------
> Arena    : 20676608
> Ordblks  : 2077
> Smblks   : 518
> Hblks    : 17
> Hblkhd   : 17350656
> Usmblks  : 0
> Fsmblks  : 53728
> Uordblks : 5223376
> Fordblks : 15453232
> Keepcost : 127616
>
> ----------------------------------------------
> Brick : 10.1.0.6:/data/brick_cm_obj_sharded
> Mallinfo
> --------
> Arena    : 21409792
> Ordblks  : 2424
> Smblks   : 604
> Hblks    : 17
> Hblkhd   : 17350656
> Usmblks  : 0
> Fsmblks  : 62304
> Uordblks : 5468096
> Fordblks : 15941696
> Keepcost : 127616
>
> ----------------------------------------------
> Brick : 10.1.0.7:/data/brick_cm_obj_sharded
> Mallinfo
> --------
> Arena    : 24240128
> Ordblks  : 2471
> Smblks   : 563
> Hblks    : 17
> Hblkhd   : 17350656
> Usmblks  : 0
> Fsmblks  : 58832
> Uordblks : 5565360
> Fordblks : 18674768
> Keepcost : 127616
>
> ----------------------------------------------
> Brick : 10.1.0.8:/data/brick_cm_obj_sharded
> Mallinfo
> --------
> Arena    : 22454272
> Ordblks  : 2575
> Smblks   : 528
> Hblks    : 17
> Hblkhd   : 17350656
> Usmblks  : 0
> Fsmblks  : 53920
> Uordblks : 5583712
> Fordblks : 16870560
> Keepcost : 127616
>
> ----------------------------------------------
> Brick : 10.1.0.9:/data/brick_cm_obj_sharded
> Mallinfo
> --------
> Arena    : 22835200
> Ordblks  : 2493
> Smblks   : 570
> Hblks    : 17
> Hblkhd   : 17350656
> Usmblks  : 0
> Fsmblks  : 59728
> Uordblks : 5424992
> Fordblks : 17410208
> Keepcost : 127616
>
> ----------------------------------------------
> Brick : 10.1.0.10:/data/brick_cm_obj_sharded
> Mallinfo
> --------
> Arena    : 23085056
> Ordblks  : 2717
> Smblks   : 697
> Hblks    : 17
> Hblkhd   : 17350656
> Usmblks  : 0
> Fsmblks  : 74016
> Uordblks : 5631520
> Fordblks : 17453536
> Keepcost : 127616
>
> ----------------------------------------------
> Brick : 10.1.0.11:/data/brick_cm_obj_sharded
> Mallinfo
> --------
> Arena    : 26537984
> Ordblks  : 3044
> Smblks   : 985
> Hblks    : 17
> Hblkhd   : 17350656
> Usmblks  : 0
> Fsmblks  : 103056
> Uordblks : 5702592
> Fordblks : 20835392
> Keepcost : 127616
>
> ----------------------------------------------
> Brick : 10.1.0.12:/data/brick_cm_obj_sharded
> Mallinfo
> --------
> Arena    : 23556096
> Ordblks  : 2658
> Smblks   : 735
> Hblks    : 17
> Hblkhd   : 17350656
> Usmblks  : 0
> Fsmblks  : 78720
> Uordblks : 5568736
> Fordblks : 17987360
> Keepcost : 127616
>
> ----------------------------------------------
> Brick : 10.1.0.13:/data/brick_cm_obj_sharded
> Mallinfo
> --------
> Arena    : 26050560
> Ordblks  : 3064
> Smblks   : 926
> Hblks    : 17
> Hblkhd   : 17350656
> Usmblks  : 0
> Fsmblks  : 96816
> Uordblks : 5807312
> Fordblks : 20243248
> Keepcost : 127616
>
> ----------------------------------------------
>
>
>
> Volume configuration details for this one:
>
> Volume Name: cm_obj_sharded
> Type: Distributed-Replicate
> Volume ID: 76c30b65-7194-4af2-80f7-bf876f426e5a
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 3 x 3 = 9
> Transport-type: tcp
> Bricks:
> Brick1: 10.1.0.5:/data/brick_cm_obj_sharded
> Brick2: 10.1.0.6:/data/brick_cm_obj_sharded
> Brick3: 10.1.0.7:/data/brick_cm_obj_sharded
> Brick4: 10.1.0.8:/data/brick_cm_obj_sharded
> Brick5: 10.1.0.9:/data/brick_cm_obj_sharded
> Brick6: 10.1.0.10:/data/brick_cm_obj_sharded
> Brick7: 10.1.0.11:/data/brick_cm_obj_sharded
> Brick8: 10.1.0.12:/data/brick_cm_obj_sharded
> Brick9: 10.1.0.13:/data/brick_cm_obj_sharded
> Options Reconfigured:
> nfs.rpc-auth-allow: 10.1.*
> auth.allow: 10.1.*
> performance.client-io-threads: on
> nfs.disable: off
> storage.fips-mode-rchecksum: on
> transport.address-family: inet
> performance.cache-size: 8GB
> performance.flush-behind: on
> performance.cache-refresh-timeout: 60
> performance.nfs.io-cache: on
> nfs.nlm: off
> nfs.export-volumes: on
> nfs.export-dirs: on
> nfs.exports-auth-enable: on
> transport.listen-backlog: 16384
> nfs.mount-rmtab: /-
> performance.io-thread-count: 32
> server.event-threads: 32
> nfs.auth-refresh-interval-sec: 360
> nfs.auth-cache-ttl-sec: 360
> features.shard: on
>
>
>
>
> There are 3 other volumes (this is the only sharded one). I can provide
> more info if desired.
>
> Typical boot times for 3k nodes and 9 leaders, ignoring BIOS setup time,
> is 2-5 minutes. The power of the image objects is what makes that fast.
> An exapnded tree (traditional) nfs export where the whole directory tree
> would be exported and used file by file would be more like 9-12 minutes.
>
>
> Erik
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210324/5cb0ad53/attachment.html>