<div dir="ltr">Just add on, we are using gluster beside our main storage Lustre for k8s cluster . </div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 24, 2021 at 4:33 AM Ewen Chan <<a href="mailto:alpha754293@hotmail.com">alpha754293@hotmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Erik:</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I just want to say that I really appreciate you sharing this information with us.</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I don't think that my personal home lab micro cluster environment may get that complicated enough where I have a virtualized testing/Gluster development setup like you have, but on the other hand, as I mentioned before, I am running 100 Gbps Infiniband so what
I am trying to do/use Gluster for is quite different than what and how most people deploy/install Gluster for production systems.</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
If I wanted to splurge, I'd get a second set of IB cables so that the high speed interconnect layer can be split so that jobs will run on one layer of the Infiniband fabric whilst storage/Gluster may run on another layer.</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
But for that, I'll have to revamp my entire microcluster, so there are no plans to do that just yet.</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Thank you.</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Sincerely,</div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Ewen<br>
</div>
<div>
<div id="gmail-m_-379120464223181042appendonsend"></div>
<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_-379120464223181042divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> <a href="mailto:gluster-users-bounces@gluster.org" target="_blank">gluster-users-bounces@gluster.org</a> <<a href="mailto:gluster-users-bounces@gluster.org" target="_blank">gluster-users-bounces@gluster.org</a>> on behalf of Erik Jacobson <<a href="mailto:erik.jacobson@hpe.com" target="_blank">erik.jacobson@hpe.com</a>><br>
<b>Sent:</b> March 23, 2021 10:43 AM<br>
<b>To:</b> Diego Zuccato <<a href="mailto:diego.zuccato@unibo.it" target="_blank">diego.zuccato@unibo.it</a>><br>
<b>Cc:</b> <a href="mailto:gluster-users@gluster.org" target="_blank">gluster-users@gluster.org</a> <<a href="mailto:gluster-users@gluster.org" target="_blank">gluster-users@gluster.org</a>><br>
<b>Subject:</b> Re: [Gluster-users] Gluster usage scenarios in HPC cluster management</font>
<div> </div>
</div>
<div><font size="2"><span style="font-size:11pt">
<div>> I still have to grasp the "leader node" concept.<br>
> Weren't gluster nodes "peers"? Or by "leader" you mean that it's<br>
> mentioned in the fstab entry like<br>
> /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0<br>
> while the peer list includes l1,l2,l3 and a bunch of other nodes?<br>
<br>
Right, it's a list of 24 peers. The 24 peers are split in to a 3x24<br>
replicated/distributed setup for the volumes. They also have entries<br>
for themselves as clients in /etc/fstab. I'll dump some volume info<br>
at the end of this.<br>
<br>
<br>
> > So we would have 24 leader nodes, each leader would have a disk serving<br>
> > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded,<br>
> > one is for logs, and one is heavily optimized for non-object expanded<br>
> > tree NFS). The term "disk" is loose.<br>
> That's a system way bigger than ours (3 nodes, replica3arbiter1, up to<br>
> 36 bricks per node).<br>
<br>
I have one dedicated "disk" (could be disk, raid lun, single ssd) and<br>
4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just<br>
for the lock and has a single file.<br>
<br>
> <br>
> > Specs of a leader node at a customer site:<br>
> > * 256G RAM<br>
> Glip! 256G for 4 bricks... No wonder I have had troubles running 26<br>
> bricks in 64GB RAM... :)<br>
<br>
I'm not an expert in memory pools or how they would be impacted by more<br>
peers. I had to do a little research and I think what you're after is<br>
if I can run gluster volume status cm_shared mem on a real cluster<br>
that has a decent node count. I will see if I can do that.<br>
<br>
<br>
TEST ENV INFO for those who care<br>
--------------------------------<br>
Here is some info on my own test environemnt which you can skip.<br>
<br>
I have the environment duplicated on my desktop using virtual machines and it<br>
runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache<br>
from the optimized volumes but other than that it is fine. In my<br>
development environment, the gluster disk is a 40G qcow2 image.<br>
<br>
Cache sizes changed from 8G to 100M to fit in the VM.<br>
<br>
XML snips for memory, cpus:<br>
<domain type='kvm' id='24'><br>
<name>cm-leader1</name><br>
<uuid>99d5a8fc-a32c-b181-2f1a-2929b29c3953</uuid><br>
<memory unit='KiB'>3268608</memory><br>
<currentMemory unit='KiB'>3268608</currentMemory><br>
<vcpu placement='static'>2</vcpu><br>
<resource><br>
......<br>
<br>
<br>
I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test<br>
compute node for my development environment.<br>
<br>
My desktop where I test this cluster stack is a beefy but not brand new<br>
desktop:<br>
<br>
Architecture: x86_64<br>
CPU op-mode(s): 32-bit, 64-bit<br>
Byte Order: Little Endian<br>
Address sizes: 46 bits physical, 48 bits virtual<br>
CPU(s): 16<br>
On-line CPU(s) list: 0-15<br>
Thread(s) per core: 2<br>
Core(s) per socket: 8<br>
Socket(s): 1<br>
NUMA node(s): 1<br>
Vendor ID: GenuineIntel<br>
CPU family: 6<br>
Model: 79<br>
Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz<br>
Stepping: 1<br>
CPU MHz: 2594.333<br>
CPU max MHz: 3000.0000<br>
CPU min MHz: 1200.0000<br>
BogoMIPS: 4190.22<br>
Virtualization: VT-x<br>
L1d cache: 32K<br>
L1i cache: 32K<br>
L2 cache: 256K<br>
L3 cache: 20480K<br>
NUMA node0 CPU(s): 0-15<br>
<SNIP><br>
<br>
<br>
(Not that it matters but this is a HP Z640 Workstation)<br>
<br>
128G memory (good for a desktop I know, but I think 64G would work since<br>
I also run windows10 vm environment for unrelated reasons)<br>
<br>
I was able to find a MegaRAID in the lab a few years ago and so I have 4<br>
drives in a MegaRAID and carve off a separate volume for the VM disk<br>
images. It has a cache. So that's also more beefy than a normal desktop.<br>
(on the other hand, I have no SSDs. May experiment with that some day<br>
but things work so well now I'm tempted to leave it until something<br>
croaks :)<br>
<br>
I keep all VMs for the test cluster with "Unsafe cache mode" since there<br>
is no true data to worry about and it makes the test cases faster.<br>
<br>
So I am able to test a complete cluster management stack including<br>
3-leader-gluster servers, an admin, and compute all on my desktop using<br>
virtual machines and shared networks within libivrt/qemu.<br>
<br>
It is so much easier to do development when you don't have to reserve<br>
scarce test clusters and compete with people. I can do 90% of my cluster<br>
development work this way. Things fall over when I need to care about<br>
BMCs/ILOs or need to do performance testing of course. Then I move to<br>
real hardware and play the hunger-games-of-internal-test-resources :) :)<br>
<br>
I mention all this just to show that the beefy servers are not needed<br>
nor the memory usage high. I'm not continually swapping or anything like<br>
that.<br>
<br>
<br>
<br>
<br>
Configuration Info from Real Machine<br>
------------------------------------<br>
<br>
Some info on an active 3x3 cluster. 2738 compute nodes.<br>
<br>
The most active volume here is "cm_obj_sharded". It is where the image<br>
objects live and this cluster uses image objects for compute node root<br>
filesystems. I by hand changed the IP addresses (in case I made an<br>
error doing that).<br>
<br>
<br>
Memory status for volume : cm_obj_sharded<br>
----------------------------------------------<br>
Brick : 10.1.0.5:/data/brick_cm_obj_sharded<br>
Mallinfo<br>
--------<br>
Arena : 20676608<br>
Ordblks : 2077<br>
Smblks : 518<br>
Hblks : 17<br>
Hblkhd : 17350656<br>
Usmblks : 0<br>
Fsmblks : 53728<br>
Uordblks : 5223376<br>
Fordblks : 15453232<br>
Keepcost : 127616<br>
<br>
----------------------------------------------<br>
Brick : 10.1.0.6:/data/brick_cm_obj_sharded<br>
Mallinfo<br>
--------<br>
Arena : 21409792<br>
Ordblks : 2424<br>
Smblks : 604<br>
Hblks : 17<br>
Hblkhd : 17350656<br>
Usmblks : 0<br>
Fsmblks : 62304<br>
Uordblks : 5468096<br>
Fordblks : 15941696<br>
Keepcost : 127616<br>
<br>
----------------------------------------------<br>
Brick : 10.1.0.7:/data/brick_cm_obj_sharded<br>
Mallinfo<br>
--------<br>
Arena : 24240128<br>
Ordblks : 2471<br>
Smblks : 563<br>
Hblks : 17<br>
Hblkhd : 17350656<br>
Usmblks : 0<br>
Fsmblks : 58832<br>
Uordblks : 5565360<br>
Fordblks : 18674768<br>
Keepcost : 127616<br>
<br>
----------------------------------------------<br>
Brick : 10.1.0.8:/data/brick_cm_obj_sharded<br>
Mallinfo<br>
--------<br>
Arena : 22454272<br>
Ordblks : 2575<br>
Smblks : 528<br>
Hblks : 17<br>
Hblkhd : 17350656<br>
Usmblks : 0<br>
Fsmblks : 53920<br>
Uordblks : 5583712<br>
Fordblks : 16870560<br>
Keepcost : 127616<br>
<br>
----------------------------------------------<br>
Brick : 10.1.0.9:/data/brick_cm_obj_sharded<br>
Mallinfo<br>
--------<br>
Arena : 22835200<br>
Ordblks : 2493<br>
Smblks : 570<br>
Hblks : 17<br>
Hblkhd : 17350656<br>
Usmblks : 0<br>
Fsmblks : 59728<br>
Uordblks : 5424992<br>
Fordblks : 17410208<br>
Keepcost : 127616<br>
<br>
----------------------------------------------<br>
Brick : 10.1.0.10:/data/brick_cm_obj_sharded<br>
Mallinfo<br>
--------<br>
Arena : 23085056<br>
Ordblks : 2717<br>
Smblks : 697<br>
Hblks : 17<br>
Hblkhd : 17350656<br>
Usmblks : 0<br>
Fsmblks : 74016<br>
Uordblks : 5631520<br>
Fordblks : 17453536<br>
Keepcost : 127616<br>
<br>
----------------------------------------------<br>
Brick : 10.1.0.11:/data/brick_cm_obj_sharded<br>
Mallinfo<br>
--------<br>
Arena : 26537984<br>
Ordblks : 3044<br>
Smblks : 985<br>
Hblks : 17<br>
Hblkhd : 17350656<br>
Usmblks : 0<br>
Fsmblks : 103056<br>
Uordblks : 5702592<br>
Fordblks : 20835392<br>
Keepcost : 127616<br>
<br>
----------------------------------------------<br>
Brick : 10.1.0.12:/data/brick_cm_obj_sharded<br>
Mallinfo<br>
--------<br>
Arena : 23556096<br>
Ordblks : 2658<br>
Smblks : 735<br>
Hblks : 17<br>
Hblkhd : 17350656<br>
Usmblks : 0<br>
Fsmblks : 78720<br>
Uordblks : 5568736<br>
Fordblks : 17987360<br>
Keepcost : 127616<br>
<br>
----------------------------------------------<br>
Brick : 10.1.0.13:/data/brick_cm_obj_sharded<br>
Mallinfo<br>
--------<br>
Arena : 26050560<br>
Ordblks : 3064<br>
Smblks : 926<br>
Hblks : 17<br>
Hblkhd : 17350656<br>
Usmblks : 0<br>
Fsmblks : 96816<br>
Uordblks : 5807312<br>
Fordblks : 20243248<br>
Keepcost : 127616<br>
<br>
----------------------------------------------<br>
<br>
<br>
<br>
Volume configuration details for this one:<br>
<br>
Volume Name: cm_obj_sharded<br>
Type: Distributed-Replicate<br>
Volume ID: 76c30b65-7194-4af2-80f7-bf876f426e5a<br>
Status: Started<br>
Snapshot Count: 0<br>
Number of Bricks: 3 x 3 = 9<br>
Transport-type: tcp<br>
Bricks:<br>
Brick1: 10.1.0.5:/data/brick_cm_obj_sharded<br>
Brick2: 10.1.0.6:/data/brick_cm_obj_sharded<br>
Brick3: 10.1.0.7:/data/brick_cm_obj_sharded<br>
Brick4: 10.1.0.8:/data/brick_cm_obj_sharded<br>
Brick5: 10.1.0.9:/data/brick_cm_obj_sharded<br>
Brick6: 10.1.0.10:/data/brick_cm_obj_sharded<br>
Brick7: 10.1.0.11:/data/brick_cm_obj_sharded<br>
Brick8: 10.1.0.12:/data/brick_cm_obj_sharded<br>
Brick9: 10.1.0.13:/data/brick_cm_obj_sharded<br>
Options Reconfigured:<br>
nfs.rpc-auth-allow: 10.1.*<br>
auth.allow: 10.1.*<br>
performance.client-io-threads: on<br>
nfs.disable: off<br>
storage.fips-mode-rchecksum: on<br>
transport.address-family: inet<br>
performance.cache-size: 8GB<br>
performance.flush-behind: on<br>
performance.cache-refresh-timeout: 60<br>
performance.nfs.io-cache: on<br>
nfs.nlm: off<br>
nfs.export-volumes: on<br>
nfs.export-dirs: on<br>
nfs.exports-auth-enable: on<br>
transport.listen-backlog: 16384<br>
nfs.mount-rmtab: /-<br>
performance.io-thread-count: 32<br>
server.event-threads: 32<br>
nfs.auth-refresh-interval-sec: 360<br>
nfs.auth-cache-ttl-sec: 360<br>
features.shard: on<br>
<br>
<br>
<br>
<br>
There are 3 other volumes (this is the only sharded one). I can provide<br>
more info if desired.<br>
<br>
Typical boot times for 3k nodes and 9 leaders, ignoring BIOS setup time,<br>
is 2-5 minutes. The power of the image objects is what makes that fast.<br>
An exapnded tree (traditional) nfs export where the whole directory tree<br>
would be exported and used file by file would be more like 9-12 minutes.<br>
<br>
<br>
Erik<br>
________<br>
<br>
<br>
<br>
Community Meeting Calendar:<br>
<br>
Schedule -<br>
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>
Bridge: <a href="https://meet.google.com/cpu-eiue-hvk" target="_blank">https://meet.google.com/cpu-eiue-hvk</a><br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
</div>
</span></font></div>
</div>
</div>
________<br>
<br>
<br>
<br>
Community Meeting Calendar:<br>
<br>
Schedule -<br>
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>
Bridge: <a href="https://meet.google.com/cpu-eiue-hvk" rel="noreferrer" target="_blank">https://meet.google.com/cpu-eiue-hvk</a><br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
</blockquote></div>