<div>If you don't experience any OOM , you can focus on the heals.</div><div><br></div><div>284 processes of glfsheal seems odd.</div><div><br><div>Can you check the ppid for 2-3 randomly picked ?</div><div>ps -o ppid= <pid></div><div><br></div><div>Best Regards,</div><div>Strahil Nikolov <br> <br> <blockquote style="margin: 0 0 20px 0;"> <div style="font-family:Roboto, sans-serif; color:#6D00F6;"> <div>On Wed, Mar 15, 2023 at 9:54, Diego Zuccato</div><div><diego.zuccato@unibo.it> wrote:</div> </div> <div style="padding: 10px 0 0 20px; margin: 10px 0 0 0; border-left: 1px solid #6D00F6;"> I enabled it yesterday and that greatly reduced memory pressure.<br clear="none">Current volume info:<br clear="none">-8<--<br clear="none">Volume Name: cluster_data<br clear="none">Type: Distributed-Replicate<br clear="none">Volume ID: a8caaa90-d161-45bb-a68c-278263a8531a<br clear="none">Status: Started<br clear="none">Snapshot Count: 0<br clear="none">Number of Bricks: 45 x (2 + 1) = 135<br clear="none">Transport-type: tcp<br clear="none">Bricks:<br clear="none">Brick1: clustor00:/srv/bricks/00/d<br clear="none">Brick2: clustor01:/srv/bricks/00/d<br clear="none">Brick3: clustor02:/srv/bricks/00/q (arbiter)<br clear="none">[...]<br clear="none">Brick133: clustor01:/srv/bricks/29/d<br clear="none">Brick134: clustor02:/srv/bricks/29/d<br clear="none">Brick135: clustor00:/srv/bricks/14/q (arbiter)<br clear="none">Options Reconfigured:<br clear="none">performance.quick-read: off<br clear="none">cluster.entry-self-heal: on<br clear="none">cluster.data-self-heal-algorithm: full<br clear="none">cluster.metadata-self-heal: on<br clear="none">cluster.shd-max-threads: 2<br clear="none">network.inode-lru-limit: 500000<br clear="none">performance.md-cache-timeout: 600<br clear="none">performance.cache-invalidation: on<br clear="none">features.cache-invalidation-timeout: 600<br clear="none">features.cache-invalidation: on<br clear="none">features.quota-deem-statfs: on<br clear="none">performance.readdir-ahead: on<br clear="none">cluster.granular-entry-heal: enable<br clear="none">features.scrub: Active<br clear="none">features.bitrot: on<br clear="none">cluster.lookup-optimize: on<br clear="none">performance.stat-prefetch: on<br clear="none">performance.cache-refresh-timeout: 60<br clear="none">performance.parallel-readdir: on<br clear="none">performance.write-behind-window-size: 128MB<br clear="none">cluster.self-heal-daemon: enable<br clear="none">features.inode-quota: on<br clear="none">features.quota: on<br clear="none">transport.address-family: inet<br clear="none">nfs.disable: on<br clear="none">performance.client-io-threads: off<br clear="none">client.event-threads: 1<br clear="none">features.scrub-throttle: normal<br clear="none">diagnostics.brick-log-level: ERROR<br clear="none">diagnostics.client-log-level: ERROR<br clear="none">config.brick-threads: 0<br clear="none">cluster.lookup-unhashed: on<br clear="none">config.client-threads: 1<br clear="none">cluster.use-anonymous-inode: off<br clear="none">diagnostics.brick-sys-log-level: CRITICAL<br clear="none">features.scrub-freq: monthly<br clear="none">cluster.data-self-heal: on<br clear="none">cluster.brick-multiplex: on<br clear="none">cluster.daemon-log-level: ERROR<br clear="none">-8<--<br clear="none"><br clear="none">htop reports that memory usage is up to 143G, there are 602 tasks and <br clear="none">5232 threads (~20 running) on clustor00, 117G/49 tasks/1565 threads on <br clear="none">clustor01 and 126G/45 tasks/1574 threads on clustor02.<br clear="none">I see quite a lot (284!) of glfsheal processes running on clustor00 (a <br clear="none">"gluster v heal cluster_data info summary" is running on clustor02 since <br clear="none">yesterday, still no output). Shouldn't be just one per brick?<br clear="none"><br clear="none">Diego<br clear="none"><br clear="none">Il 15/03/2023 08:30, Strahil Nikolov ha scritto:<br clear="none">> Do you use brick multiplexing ?<br clear="none">> <br clear="none">> Best Regards,<br clear="none">> Strahil Nikolov<br clear="none">> <br clear="none">>     On Tue, Mar 14, 2023 at 16:44, Diego Zuccato<br clear="none">>     <<a shape="rect" ymailto="mailto:diego.zuccato@unibo.it" href="mailto:diego.zuccato@unibo.it">diego.zuccato@unibo.it</a>> wrote:<br clear="none">>     Hello all.<br clear="none">> <br clear="none">>     Our Gluster 9.6 cluster is showing increasing problems.<br clear="none">>     Currently it's composed of 3 servers (2x Intel Xeon 4210 [20 cores dual<br clear="none">>     thread, total 40 threads], 192GB RAM, 30x HGST HUH721212AL5200 [12TB]),<br clear="none">>     configured in replica 3 arbiter 1. Using Debian packages from Gluster<br clear="none">>     9.x latest repository.<br clear="none">> <br clear="none">>     Seems 192G RAM are not enough to handle 30 data bricks + 15 arbiters<br clear="none">>     and<br clear="none">>     I often had to reload glusterfsd because glusterfs processed got killed<br clear="none">>     for OOM.<br clear="none">>     On top of that, performance have been quite bad, especially when we<br clear="none">>     reached about 20M files. On top of that, one of the servers have had<br clear="none">>     mobo issues that resulted in memory errors that corrupted some<br clear="none">>     bricks fs<br clear="none">>     (XFS, it required "xfs_reparir -L" to fix).<br clear="none">>     Now I'm getting lots of "stale file handle" errors and other errors<br clear="none">>     (like directories that seem empty from the client but still containing<br clear="none">>     files in some bricks) and auto healing seems unable to complete.<br clear="none">> <br clear="none">>     Since I can't keep up continuing to manually fix all the issues, I'm<br clear="none">>     thinking about backup+destroy+recreate strategy.<br clear="none">> <br clear="none">>     I think that if I reduce the number of bricks per server to just 5<br clear="none">>     (RAID1 of 6x12TB disks) I might resolve RAM issues - at the cost of<br clear="none">>     longer heal times in case a disk fails. Am I right or it's useless?<br clear="none">>     Other recommendations?<br clear="none">>     Servers have space for another 6 disks. Maybe those could be used for<br clear="none">>     some SSDs to speed up access?<br clear="none">> <br clear="none">>     TIA.<br clear="none">> <br clear="none">>     -- <br clear="none">>     Diego Zuccato<br clear="none">>     DIFA - Dip. di Fisica e Astronomia<br clear="none">>     Servizi Informatici<br clear="none">>     Alma Mater Studiorum - Università di Bologna<br clear="none">>     V.le Berti-Pichat 6/2 - 40127 Bologna - Italy<br clear="none">>     tel.: +39 051 20 95786<br clear="none">>     ________<br clear="none">> <br clear="none">> <br clear="none">> <br clear="none">>     Community Meeting Calendar:<br clear="none">> <br clear="none">>     Schedule -<br clear="none">>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br clear="none">>     Bridge: <a shape="rect" href="https://meet.google.com/cpu-eiue-hvk" target="_blank">https://meet.google.com/cpu-eiue-hvk</a><br clear="none">>     <<a shape="rect" href="https://meet.google.com/cpu-eiue-hvk" target="_blank">https://meet.google.com/cpu-eiue-hvk</a>><br clear="none">>     Gluster-users mailing list<br clear="none">>     <a shape="rect" ymailto="mailto:Gluster-users@gluster.org" href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a> <mailto:Gluster-users@gluster.org><br clear="none">>     <a shape="rect" href="https://lists.gluster.org/mailman/listinfo/gluster-users" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br clear="none">>     <<a shape="rect" href="https://lists.gluster.org/mailman/listinfo/gluster-users" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a>><div class="yqt3860832755" id="yqtfd30755"><br clear="none">> <br clear="none"><br clear="none">-- <br clear="none">Diego Zuccato<br clear="none">DIFA - Dip. di Fisica e Astronomia<br clear="none">Servizi Informatici<br clear="none">Alma Mater Studiorum - Università di Bologna<br clear="none">V.le Berti-Pichat 6/2 - 40127 Bologna - Italy<br clear="none">tel.: +39 051 20 95786<br clear="none">________<br clear="none"><br clear="none"><br clear="none"><br clear="none">Community Meeting Calendar:<br clear="none"><br clear="none">Schedule -<br clear="none">Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br clear="none">Bridge: <a shape="rect" href="https://meet.google.com/cpu-eiue-hvk" target="_blank">https://meet.google.com/cpu-eiue-hvk</a><br clear="none">Gluster-users mailing list<br clear="none"><a shape="rect" ymailto="mailto:Gluster-users@gluster.org" href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br clear="none"><a shape="rect" href="https://lists.gluster.org/mailman/listinfo/gluster-users" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br clear="none"></div> </div> </blockquote></div></div>