<div dir="ltr">Just add on, we are using gluster beside our main storage Lustre for k8s cluster  . </div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 24, 2021 at 4:33 AM Ewen Chan &lt;<a href="mailto:alpha754293@hotmail.com">alpha754293@hotmail.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


<div dir="ltr">

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

Erik:</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

I just want to say that I really appreciate you sharing this information with us.</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

I don&#39;t think that my personal home lab micro cluster environment may get that complicated enough where I have a virtualized testing/Gluster development setup like you have, but on the other hand, as I mentioned before, I am running 100 Gbps Infiniband so what

 I am trying to do/use Gluster for is quite different than what and how most people deploy/install Gluster for production systems.</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

If I wanted to splurge, I&#39;d get a second set of IB cables so that the high speed interconnect layer can be split so that jobs will run on one layer of the Infiniband fabric whilst storage/Gluster may run on another layer.</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

But for that, I&#39;ll have to revamp my entire microcluster, so there are no plans to do that just yet.</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

Thank you.</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

Sincerely,</div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

Ewen<br>

</div>

<div>

<div id="gmail-m_-379120464223181042appendonsend"></div>

<div style="font-family:Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

<br>

</div>

<hr style="display:inline-block;width:98%">

<div id="gmail-m_-379120464223181042divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> <a href="mailto:gluster-users-bounces@gluster.org" target="_blank">gluster-users-bounces@gluster.org</a> &lt;<a href="mailto:gluster-users-bounces@gluster.org" target="_blank">gluster-users-bounces@gluster.org</a>&gt; on behalf of Erik Jacobson &lt;<a href="mailto:erik.jacobson@hpe.com" target="_blank">erik.jacobson@hpe.com</a>&gt;<br>

<b>Sent:</b> March 23, 2021 10:43 AM<br>

<b>To:</b> Diego Zuccato &lt;<a href="mailto:diego.zuccato@unibo.it" target="_blank">diego.zuccato@unibo.it</a>&gt;<br>

<b>Cc:</b> <a href="mailto:gluster-users@gluster.org" target="_blank">gluster-users@gluster.org</a> &lt;<a href="mailto:gluster-users@gluster.org" target="_blank">gluster-users@gluster.org</a>&gt;<br>

<b>Subject:</b> Re: [Gluster-users] Gluster usage scenarios in HPC cluster management</font>

<div> </div>

</div>

<div><font size="2"><span style="font-size:11pt">

<div>&gt; I still have to grasp the &quot;leader node&quot; concept.<br>

&gt; Weren&#39;t gluster nodes &quot;peers&quot;? Or by &quot;leader&quot; you mean that it&#39;s<br>

&gt; mentioned in the fstab entry like<br>

&gt; /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0<br>

&gt; while the peer list includes l1,l2,l3 and a bunch of other nodes?<br>

<br>

Right, it&#39;s a list of 24 peers. The 24 peers are split in to a 3x24<br>

replicated/distributed setup for the volumes. They also have entries<br>

for themselves as clients in /etc/fstab. I&#39;ll dump some volume info<br>

at the end of this.<br>

<br>

<br>

&gt; &gt; So we would have 24 leader nodes, each leader would have a disk serving<br>

&gt; &gt; 4 bricks (one of which is simply a lock FS for CTDB, one is sharded,<br>

&gt; &gt; one is for logs, and one is heavily optimized for non-object expanded<br>

&gt; &gt; tree NFS). The term &quot;disk&quot; is loose.<br>

&gt; That&#39;s a system way bigger than ours (3 nodes, replica3arbiter1, up to<br>

&gt; 36 bricks per node).<br>

<br>

I have one dedicated &quot;disk&quot; (could be disk, raid lun, single ssd) and<br>

4 directories for volumes (&quot;bricks&quot;). Of course, the &quot;ctdb&quot; volume is just<br>

for the lock and has a single file.<br>

<br>

&gt; <br>

&gt; &gt; Specs of a leader node at a customer site:<br>

&gt; &gt;  * 256G RAM<br>

&gt; Glip! 256G for 4 bricks... No wonder I have had troubles running 26<br>

&gt; bricks in 64GB RAM... :)<br>

<br>

I&#39;m not an expert in memory pools or how they would be impacted by more<br>

peers. I had to do a little research and I think what you&#39;re after is<br>

if I can run gluster volume status cm_shared mem on a real cluster<br>

that has a decent node count. I will see if I can do that.<br>

<br>

<br>

TEST ENV INFO for those who care<br>

--------------------------------<br>

Here is some info on my own test environemnt which you can skip.<br>

<br>

I have the environment duplicated on my desktop using virtual machines and it<br>

runs fine (slow but fine). It&#39;s a 3x1. I take out my giant 8GB cache<br>

from the optimized volumes but other than that it is fine. In my<br>

development environment, the gluster disk is a 40G qcow2 image.<br>

<br>

Cache sizes changed from 8G to 100M to fit in the VM.<br>

<br>

XML snips for memory, cpus:<br>

&lt;domain type=&#39;kvm&#39; id=&#39;24&#39;&gt;<br>

  &lt;name&gt;cm-leader1&lt;/name&gt;<br>

  &lt;uuid&gt;99d5a8fc-a32c-b181-2f1a-2929b29c3953&lt;/uuid&gt;<br>

  &lt;memory unit=&#39;KiB&#39;&gt;3268608&lt;/memory&gt;<br>

  &lt;currentMemory unit=&#39;KiB&#39;&gt;3268608&lt;/currentMemory&gt;<br>

  &lt;vcpu placement=&#39;static&#39;&gt;2&lt;/vcpu&gt;<br>

  &lt;resource&gt;<br>

......<br>

<br>

<br>

I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test<br>

compute node for my development environment.<br>

<br>

My desktop where I test this cluster stack is a beefy but not brand new<br>

desktop:<br>

<br>

Architecture:        x86_64<br>

CPU op-mode(s):      32-bit, 64-bit<br>

Byte Order:          Little Endian<br>

Address sizes:       46 bits physical, 48 bits virtual<br>

CPU(s):              16<br>

On-line CPU(s) list: 0-15<br>

Thread(s) per core:  2<br>

Core(s) per socket:  8<br>

Socket(s):           1<br>

NUMA node(s):        1<br>

Vendor ID:           GenuineIntel<br>

CPU family:          6<br>

Model:               79<br>

Model name:          Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz<br>

Stepping:            1<br>

CPU MHz:             2594.333<br>

CPU max MHz:         3000.0000<br>

CPU min MHz:         1200.0000<br>

BogoMIPS:            4190.22<br>

Virtualization:      VT-x<br>

L1d cache:           32K<br>

L1i cache:           32K<br>

L2 cache:            256K<br>

L3 cache:            20480K<br>

NUMA node0 CPU(s):   0-15<br>

&lt;SNIP&gt;<br>

<br>

<br>

(Not that it matters but this is a HP Z640 Workstation)<br>

<br>

128G memory (good for a desktop I know, but I think 64G would work since<br>

I also run windows10 vm environment for unrelated reasons)<br>

<br>

I was able to find a MegaRAID in the lab a few years ago and so I have 4<br>

drives in a MegaRAID and carve off a separate volume for the VM disk<br>

images. It has a cache. So that&#39;s also more beefy than a normal desktop.<br>

(on the other hand, I have no SSDs. May experiment with that some day<br>

but things work so well now I&#39;m tempted to leave it until something<br>

croaks :)<br>

<br>

I keep all VMs for the test cluster with &quot;Unsafe cache mode&quot; since there<br>

is no true data to worry about and it makes the test cases faster.<br>

<br>

So I am able to test a complete cluster management stack including<br>

3-leader-gluster servers, an admin, and compute all on my desktop using<br>

virtual machines and shared networks within libivrt/qemu.<br>

<br>

It is so much easier to do development when you don&#39;t have to reserve<br>

scarce test clusters and compete with people. I can do 90% of my cluster<br>

development work this way. Things fall over when I need to care about<br>

BMCs/ILOs or need to do performance testing of course. Then I move to<br>

real hardware and play the hunger-games-of-internal-test-resources :) :)<br>

<br>

I mention all this just to show that the beefy servers are not needed<br>

nor the memory usage high. I&#39;m not continually swapping or anything like<br>

that.<br>

<br>

<br>

<br>

<br>

Configuration Info from Real Machine<br>

------------------------------------<br>

<br>

Some info on an active 3x3 cluster. 2738 compute nodes.<br>

<br>

The most active volume here is &quot;cm_obj_sharded&quot;. It is where the image<br>

objects live and this cluster uses image objects for compute node root<br>

filesystems. I by hand changed the IP addresses (in case I made an<br>

error doing that).<br>

<br>

<br>

Memory status for volume : cm_obj_sharded<br>

----------------------------------------------<br>

Brick : 10.1.0.5:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena    : 20676608<br>

Ordblks  : 2077<br>

Smblks   : 518<br>

Hblks    : 17<br>

Hblkhd   : 17350656<br>

Usmblks  : 0<br>

Fsmblks  : 53728<br>

Uordblks : 5223376<br>

Fordblks : 15453232<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.6:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena    : 21409792<br>

Ordblks  : 2424<br>

Smblks   : 604<br>

Hblks    : 17<br>

Hblkhd   : 17350656<br>

Usmblks  : 0<br>

Fsmblks  : 62304<br>

Uordblks : 5468096<br>

Fordblks : 15941696<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.7:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena    : 24240128<br>

Ordblks  : 2471<br>

Smblks   : 563<br>

Hblks    : 17<br>

Hblkhd   : 17350656<br>

Usmblks  : 0<br>

Fsmblks  : 58832<br>

Uordblks : 5565360<br>

Fordblks : 18674768<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.8:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena    : 22454272<br>

Ordblks  : 2575<br>

Smblks   : 528<br>

Hblks    : 17<br>

Hblkhd   : 17350656<br>

Usmblks  : 0<br>

Fsmblks  : 53920<br>

Uordblks : 5583712<br>

Fordblks : 16870560<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.9:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena    : 22835200<br>

Ordblks  : 2493<br>

Smblks   : 570<br>

Hblks    : 17<br>

Hblkhd   : 17350656<br>

Usmblks  : 0<br>

Fsmblks  : 59728<br>

Uordblks : 5424992<br>

Fordblks : 17410208<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.10:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena    : 23085056<br>

Ordblks  : 2717<br>

Smblks   : 697<br>

Hblks    : 17<br>

Hblkhd   : 17350656<br>

Usmblks  : 0<br>

Fsmblks  : 74016<br>

Uordblks : 5631520<br>

Fordblks : 17453536<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.11:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena    : 26537984<br>

Ordblks  : 3044<br>

Smblks   : 985<br>

Hblks    : 17<br>

Hblkhd   : 17350656<br>

Usmblks  : 0<br>

Fsmblks  : 103056<br>

Uordblks : 5702592<br>

Fordblks : 20835392<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.12:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena    : 23556096<br>

Ordblks  : 2658<br>

Smblks   : 735<br>

Hblks    : 17<br>

Hblkhd   : 17350656<br>

Usmblks  : 0<br>

Fsmblks  : 78720<br>

Uordblks : 5568736<br>

Fordblks : 17987360<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

Brick : 10.1.0.13:/data/brick_cm_obj_sharded<br>

Mallinfo<br>

--------<br>

Arena    : 26050560<br>

Ordblks  : 3064<br>

Smblks   : 926<br>

Hblks    : 17<br>

Hblkhd   : 17350656<br>

Usmblks  : 0<br>

Fsmblks  : 96816<br>

Uordblks : 5807312<br>

Fordblks : 20243248<br>

Keepcost : 127616<br>

<br>

----------------------------------------------<br>

<br>

<br>

<br>

Volume configuration details for this one:<br>

<br>

Volume Name: cm_obj_sharded<br>

Type: Distributed-Replicate<br>

Volume ID: 76c30b65-7194-4af2-80f7-bf876f426e5a<br>

Status: Started<br>

Snapshot Count: 0<br>

Number of Bricks: 3 x 3 = 9<br>

Transport-type: tcp<br>

Bricks:<br>

Brick1: 10.1.0.5:/data/brick_cm_obj_sharded<br>

Brick2: 10.1.0.6:/data/brick_cm_obj_sharded<br>

Brick3: 10.1.0.7:/data/brick_cm_obj_sharded<br>

Brick4: 10.1.0.8:/data/brick_cm_obj_sharded<br>

Brick5: 10.1.0.9:/data/brick_cm_obj_sharded<br>

Brick6: 10.1.0.10:/data/brick_cm_obj_sharded<br>

Brick7: 10.1.0.11:/data/brick_cm_obj_sharded<br>

Brick8: 10.1.0.12:/data/brick_cm_obj_sharded<br>

Brick9: 10.1.0.13:/data/brick_cm_obj_sharded<br>

Options Reconfigured:<br>

nfs.rpc-auth-allow: 10.1.*<br>

auth.allow: 10.1.*<br>

performance.client-io-threads: on<br>

nfs.disable: off<br>

storage.fips-mode-rchecksum: on<br>

transport.address-family: inet<br>

performance.cache-size: 8GB<br>

performance.flush-behind: on<br>

performance.cache-refresh-timeout: 60<br>

performance.nfs.io-cache: on<br>

nfs.nlm: off<br>

nfs.export-volumes: on<br>

nfs.export-dirs: on<br>

nfs.exports-auth-enable: on<br>

transport.listen-backlog: 16384<br>

nfs.mount-rmtab: /-<br>

performance.io-thread-count: 32<br>

server.event-threads: 32<br>

nfs.auth-refresh-interval-sec: 360<br>

nfs.auth-cache-ttl-sec: 360<br>

features.shard: on<br>

<br>

<br>

<br>

<br>

There are 3 other volumes (this is the only sharded one). I can provide<br>

more info if desired.<br>

<br>

Typical boot times for 3k nodes and 9 leaders, ignoring BIOS setup time,<br>

is 2-5 minutes. The power of the image objects is what makes that fast.<br>

An exapnded tree (traditional) nfs export where the whole directory tree<br>

would be exported and used file by file would be more like 9-12 minutes.<br>

<br>

<br>

Erik<br>

________<br>

<br>

<br>

<br>

Community Meeting Calendar:<br>

<br>

Schedule -<br>

Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>

Bridge: <a href="https://meet.google.com/cpu-eiue-hvk" target="_blank">https://meet.google.com/cpu-eiue-hvk</a><br>

Gluster-users mailing list<br>

<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>

<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>

</div>

</span></font></div>

</div>

</div>


________<br>

<br>

<br>

<br>

Community Meeting Calendar:<br>

<br>

Schedule -<br>

Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>

Bridge: <a href="https://meet.google.com/cpu-eiue-hvk" rel="noreferrer" target="_blank">https://meet.google.com/cpu-eiue-hvk</a><br>

Gluster-users mailing list<br>

<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>

<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>

</blockquote></div>