<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>

</head>

<body dir="ltr">

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Erik:</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

I just want to say that I really appreciate you sharing this information with us.</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

I don't think that my personal home lab micro cluster environment may get that complicated enough where I have a virtualized testing/Gluster development setup like you have, but on the other hand, as I mentioned before, I am running 100 Gbps Infiniband so what

 I am trying to do/use Gluster for is quite different than what and how most people deploy/install Gluster for production systems.</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

If I wanted to splurge, I'd get a second set of IB cables so that the high speed interconnect layer can be split so that jobs will run on one layer of the Infiniband fabric whilst storage/Gluster may run on another layer.</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

But for that, I'll have to revamp my entire microcluster, so there are no plans to do that just yet.</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Thank you.</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Sincerely,</div>

<div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Ewen<br>

</div>

<div>

<div id="appendonsend"></div>

<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

<br>

</div>

<hr tabindex="-1" style="display:inline-block; width:98%">

<div id="divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> gluster-users-bounces@gluster.org &lt;gluster-users-bounces@gluster.org&gt; on behalf of Erik Jacobson &lt;erik.jacobson@hpe.com&gt;<br>

<b>Sent:</b> March 23, 2021 10:43 AM<br>

<b>To:</b> Diego Zuccato &lt;diego.zuccato@unibo.it&gt;<br>

<b>Cc:</b> gluster-users@gluster.org &lt;gluster-users@gluster.org&gt;<br>

<b>Subject:</b> Re: [Gluster-users] Gluster usage scenarios in HPC cluster management</font>

<div>&nbsp;</div>

</div>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt">

<div class="PlainText">&gt; I still have to grasp the &quot;leader node&quot; concept.<br>

&gt; Weren't gluster nodes &quot;peers&quot;? Or by &quot;leader&quot; you mean that it's<br>

&gt; mentioned in the fstab entry like<br>

&gt; /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0<br>

&gt; while the peer list includes l1,l2,l3 and a bunch of other nodes?<br>

<br>

Right, it's a list of 24 peers. The 24 peers are split in to a 3x24<br>

replicated/distributed setup for the volumes. They also have entries<br>

for themselves as clients in /etc/fstab. I'll dump some volume info<br>

at the end of this.<br>

<br>

<br>

&gt; &gt; So we would have 24 leader nodes, each leader would have a disk serving<br>

&gt; &gt; 4 bricks (one of which is simply a lock FS for CTDB, one is sharded,<br>

&gt; &gt; one is for logs, and one is heavily optimized for non-object expanded<br>

&gt; &gt; tree NFS). The term &quot;disk&quot; is loose.<br>

&gt; That's a system way bigger than ours (3 nodes, replica3arbiter1, up to<br>

&gt; 36 bricks per node).<br>

<br>

I have one dedicated &quot;disk&quot; (could be disk, raid lun, single ssd) and<br>

4 directories for volumes (&quot;bricks&quot;). Of course, the &quot;ctdb&quot; volume is just<br>

for the lock and has a single file.<br>

<br>

&gt; <br>

&gt; &gt; Specs of a leader node at a customer site:<br>

&gt; &gt;&nbsp; * 256G RAM<br>

&gt; Glip! 256G for 4 bricks... No wonder I have had troubles running 26<br>

&gt; bricks in 64GB RAM... :)<br>

<br>

I'm not an expert in memory pools or how they would be impacted by more<br>

peers. I had to do a little research and I think what you're after is<br>

if I can run gluster volume status cm_shared mem on a real cluster<br>

that has a decent node count. I will see if I can do that.<br>

<br>

<br>

TEST ENV INFO for those who care<br>

--------------------------------<br>

Here is some info on my own test environemnt which you can skip.<br>

<br>

I have the environment duplicated on my desktop using virtual machines and it<br>

runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache<br>

from the optimized volumes but other than that it is fine. In my<br>

development environment, the gluster disk is a 40G qcow2 image.<br>

<br>

Cache sizes changed from 8G to 100M to fit in the VM.<br>

<br>

XML snips for memory, cpus:<br>

&lt;domain type='kvm' id='24'&gt;<br>

&nbsp; &lt;name&gt;cm-leader1&lt;/name&gt;<br>

&nbsp; &lt;uuid&gt;99d5a8fc-a32c-b181-2f1a-2929b29c3953&lt;/uuid&gt;<br>

&nbsp; &lt;memory unit='KiB'&gt;3268608&lt;/memory&gt;<br>

&nbsp; &lt;currentMemory unit='KiB'&gt;3268608&lt;/currentMemory&gt;<br>

&nbsp; &lt;vcpu placement='static'&gt;2&lt;/vcpu&gt;<br>

&nbsp; &lt;resource&gt;<br>

......<br>

<br>

<br>

I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test<br>

compute node for my development environment.<br>

<br>

My desktop where I test this cluster stack is a beefy but not brand new<br>

desktop:<br>

<br>

Architecture:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; x86_64<br>

CPU op-mode(s):&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 32-bit, 64-bit<br>

Byte Order:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Little Endian<br>

Address sizes:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 46 bits physical, 48 bits virtual<br>

CPU(s):&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 16<br>

On-line CPU(s) list: 0-15<br>

Thread(s) per core:&nbsp; 2<br>

Core(s) per socket:&nbsp; 8<br>

Socket(s):&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br>

NUMA node(s):&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br>

Vendor ID:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GenuineIntel<br>

CPU family:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 6<br>

Model:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 79<br>

Model name:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz<br>

Stepping:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br>

CPU MHz:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2594.333<br>

CPU max MHz:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3000.0000<br>

CPU min MHz:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1200.0000<br>

BogoMIPS:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4190.22<br>

Virtualization:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; VT-x<br>

L1d cache:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 32K<br>

L1i cache:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 32K<br>

L2 cache:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 256K<br>

L3 cache:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 20480K<br>

NUMA node0 CPU(s):&nbsp;&nbsp; 0-15<br>

&lt;SNIP&gt;<br>

<br>

<br>

(Not that it matters but this is a HP Z640 Workstation)<br>

<br>

128G memory (good for a desktop I know, but I think 64G would work since<br>

I also run windows10 vm environment for unrelated reasons)<br>

<br>

I was able to find a MegaRAID in the lab a few years ago and so I have 4<br>

drives in a MegaRAID and carve off a separate volume for the VM disk<br>

images. It has a cache. So that's also more beefy than a normal desktop.<br>

(on the other hand, I have no SSDs. May experiment with that some day<br>

but things work so well now I'm tempted to leave it until something<br>

croaks :)<br>

<br>

I keep all VMs for the test cluster with &quot;Unsafe cache mode&quot; since there<br>

is no true data to worry about and it makes the test cases faster.<br>

<br>

So I am able to test a complete cluster management stack including<br>

3-leader-gluster servers, an admin, and compute all on my desktop using<br>

virtual machines and shared networks within libivrt/qemu.<br>

<br>

It is so much easier to do development when you don't have to reserve<br>

scarce test clusters and compete with people. I can do 90% of my cluster<br>

development work this way. Things fall over when I need to care about<br>

BMCs/ILOs or need to do performance testing of course. Then I move to<br>

real hardware and play the hunger-games-of-internal-test-resources :) :)<br>

<br>

I mention all this just to show that the beefy servers are not needed<br>

nor the memory usage high. I'm not continually swapping or anything like<br>

that.<br>

<br>

<br>

<br>

<br>

Configuration Info from Real Machine<br>

------------------------------------<br>

<br>

Some info on an active 3x3 cluster. 2738 compute nodes.<br>

<br>

The most active volume here is &quot;cm_obj_sharded&quot;. It is where the image<br>

objects live and this cluster uses image objects for compute node root<br>

filesystems. I by hand changed the IP addresses (in case I made an<br>

error doing that).<br>

<br>

<br>

Memory status for volume : cm_obj_sharded<br>

----------------------------------------------<br>

Brick : 10.1.0.5:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena&nbsp;&nbsp;&nbsp; : 20676608<br>

Ordblks&nbsp; : 2077<br>

Smblks&nbsp;&nbsp; : 518<br>

Hblks&nbsp;&nbsp;&nbsp; : 17<br>

Hblkhd&nbsp;&nbsp; : 17350656<br>

Usmblks&nbsp; : 0<br>

Fsmblks&nbsp; : 53728<br>

Uordblks : 5223376<br>

Fordblks : 15453232<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.6:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena&nbsp;&nbsp;&nbsp; : 21409792<br>

Ordblks&nbsp; : 2424<br>

Smblks&nbsp;&nbsp; : 604<br>

Hblks&nbsp;&nbsp;&nbsp; : 17<br>

Hblkhd&nbsp;&nbsp; : 17350656<br>

Usmblks&nbsp; : 0<br>

Fsmblks&nbsp; : 62304<br>

Uordblks : 5468096<br>

Fordblks : 15941696<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.7:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena&nbsp;&nbsp;&nbsp; : 24240128<br>

Ordblks&nbsp; : 2471<br>

Smblks&nbsp;&nbsp; : 563<br>

Hblks&nbsp;&nbsp;&nbsp; : 17<br>

Hblkhd&nbsp;&nbsp; : 17350656<br>

Usmblks&nbsp; : 0<br>

Fsmblks&nbsp; : 58832<br>

Uordblks : 5565360<br>

Fordblks : 18674768<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.8:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena&nbsp;&nbsp;&nbsp; : 22454272<br>

Ordblks&nbsp; : 2575<br>

Smblks&nbsp;&nbsp; : 528<br>

Hblks&nbsp;&nbsp;&nbsp; : 17<br>

Hblkhd&nbsp;&nbsp; : 17350656<br>

Usmblks&nbsp; : 0<br>

Fsmblks&nbsp; : 53920<br>

Uordblks : 5583712<br>

Fordblks : 16870560<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.9:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena&nbsp;&nbsp;&nbsp; : 22835200<br>

Ordblks&nbsp; : 2493<br>

Smblks&nbsp;&nbsp; : 570<br>

Hblks&nbsp;&nbsp;&nbsp; : 17<br>

Hblkhd&nbsp;&nbsp; : 17350656<br>

Usmblks&nbsp; : 0<br>

Fsmblks&nbsp; : 59728<br>

Uordblks : 5424992<br>

Fordblks : 17410208<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.10:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena&nbsp;&nbsp;&nbsp; : 23085056<br>

Ordblks&nbsp; : 2717<br>

Smblks&nbsp;&nbsp; : 697<br>

Hblks&nbsp;&nbsp;&nbsp; : 17<br>

Hblkhd&nbsp;&nbsp; : 17350656<br>

Usmblks&nbsp; : 0<br>

Fsmblks&nbsp; : 74016<br>

Uordblks : 5631520<br>

Fordblks : 17453536<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.11:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena&nbsp;&nbsp;&nbsp; : 26537984<br>

Ordblks&nbsp; : 3044<br>

Smblks&nbsp;&nbsp; : 985<br>

Hblks&nbsp;&nbsp;&nbsp; : 17<br>

Hblkhd&nbsp;&nbsp; : 17350656<br>

Usmblks&nbsp; : 0<br>

Fsmblks&nbsp; : 103056<br>

Uordblks : 5702592<br>

Fordblks : 20835392<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.12:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena&nbsp;&nbsp;&nbsp; : 23556096<br>

Ordblks&nbsp; : 2658<br>

Smblks&nbsp;&nbsp; : 735<br>

Hblks&nbsp;&nbsp;&nbsp; : 17<br>

Hblkhd&nbsp;&nbsp; : 17350656<br>

Usmblks&nbsp; : 0<br>

Fsmblks&nbsp; : 78720<br>

Uordblks : 5568736<br>

Fordblks : 17987360<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.13:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena&nbsp;&nbsp;&nbsp; : 26050560<br>

Ordblks&nbsp; : 3064<br>

Smblks&nbsp;&nbsp; : 926<br>

Hblks&nbsp;&nbsp;&nbsp; : 17<br>

Hblkhd&nbsp;&nbsp; : 17350656<br>

Usmblks&nbsp; : 0<br>

Fsmblks&nbsp; : 96816<br>

Uordblks : 5807312<br>

Fordblks : 20243248<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

<br>

<br>

<br>

Volume configuration details for this one:<br>

<br>

Volume Name: cm_obj_sharded<br>

Type: Distributed-Replicate<br>

Volume ID: 76c30b65-7194-4af2-80f7-bf876f426e5a<br>

Status: Started<br>

Snapshot Count: 0<br>

Number of Bricks: 3 x 3 = 9<br>

Transport-type: tcp<br>

Bricks:<br>

Brick1: 10.1.0.5:/data/brick_cm_obj_sharded<br>

Brick2: 10.1.0.6:/data/brick_cm_obj_sharded<br>

Brick3: 10.1.0.7:/data/brick_cm_obj_sharded<br>

Brick4: 10.1.0.8:/data/brick_cm_obj_sharded<br>

Brick5: 10.1.0.9:/data/brick_cm_obj_sharded<br>

Brick6: 10.1.0.10:/data/brick_cm_obj_sharded<br>

Brick7: 10.1.0.11:/data/brick_cm_obj_sharded<br>

Brick8: 10.1.0.12:/data/brick_cm_obj_sharded<br>

Brick9: 10.1.0.13:/data/brick_cm_obj_sharded<br>

Options Reconfigured:<br>

nfs.rpc-auth-allow: 10.1.*<br>

auth.allow: 10.1.*<br>

performance.client-io-threads: on<br>

nfs.disable: off<br>

storage.fips-mode-rchecksum: on<br>

transport.address-family: inet<br>

performance.cache-size: 8GB<br>

performance.flush-behind: on<br>

performance.cache-refresh-timeout: 60<br>

performance.nfs.io-cache: on<br>

nfs.nlm: off<br>

nfs.export-volumes: on<br>

nfs.export-dirs: on<br>

nfs.exports-auth-enable: on<br>

transport.listen-backlog: 16384<br>

nfs.mount-rmtab: /-<br>

performance.io-thread-count: 32<br>

server.event-threads: 32<br>

nfs.auth-refresh-interval-sec: 360<br>

nfs.auth-cache-ttl-sec: 360<br>

features.shard: on<br>

<br>

<br>

<br>

<br>

There are 3 other volumes (this is the only sharded one). I can provide<br>

more info if desired.<br>

<br>

Typical boot times for 3k nodes and 9 leaders, ignoring BIOS setup time,<br>

is 2-5 minutes. The power of the image objects is what makes that fast.<br>

An exapnded tree (traditional) nfs export where the whole directory tree<br>

would be exported and used file by file would be more like 9-12 minutes.<br>

<br>

<br>

Erik<br>

________<br>

<br>

<br>

<br>

Community Meeting Calendar:<br>

<br>

Schedule -<br>

Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>

Bridge: <a href="https://meet.google.com/cpu-eiue-hvk">https://meet.google.com/cpu-eiue-hvk</a><br>

Gluster-users mailing list<br>

Gluster-users@gluster.org<br>

<a href="https://lists.gluster.org/mailman/listinfo/gluster-users">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>

</div>

</span></font></div>

</div>

</body>

</html>