<div dir="ltr">Thanks for the info, Hari. Sorry about the bad gluster volume info, I grabbed that from a file not realizing it was out of date. Here&#39;s a current configuration showing the active hot tier:<div><br></div><div><div>[root@pod-sjc1-gluster1 ~]# gluster volume info</div><div> </div><div>Volume Name: gv0</div><div>Type: Tier</div><div>Volume ID: d490a9ec-f9c8-4f10-a7f3-e1b6d3ced196</div><div>Status: Started</div><div>Snapshot Count: 13</div><div>Number of Bricks: 8</div><div>Transport-type: tcp</div><div>Hot Tier :</div><div>Hot Tier Type : Replicate</div><div>Number of Bricks: 1 x 2 = 2</div><div>Brick1: pod-sjc1-gluster2:/data/hot_tier/gv0</div><div>Brick2: pod-sjc1-gluster1:/data/hot_tier/gv0</div><div>Cold Tier:</div><div>Cold Tier Type : Distributed-Replicate</div><div>Number of Bricks: 3 x 2 = 6</div><div>Brick3: pod-sjc1-gluster1:/data/brick1/gv0</div><div>Brick4: pod-sjc1-gluster2:/data/brick1/gv0</div><div>Brick5: pod-sjc1-gluster1:/data/brick2/gv0</div><div>Brick6: pod-sjc1-gluster2:/data/brick2/gv0</div><div>Brick7: pod-sjc1-gluster1:/data/brick3/gv0</div><div>Brick8: pod-sjc1-gluster2:/data/brick3/gv0</div><div>Options Reconfigured:</div><div>performance.rda-low-wmark: 4KB</div><div>performance.rda-request-size: 128KB</div><div>storage.build-pgfid: on</div><div>cluster.watermark-low: 50</div><div>performance.readdir-ahead: off</div><div>cluster.tier-cold-compact-frequency: 86400</div><div>cluster.tier-hot-compact-frequency: 86400</div><div>features.ctr-sql-db-wal-autocheckpoint: 2500</div><div>cluster.tier-max-mb: 64000</div><div>cluster.tier-max-promote-file-size: 10485760</div><div>cluster.tier-max-files: 100000</div><div>cluster.tier-demote-frequency: 3600</div><div>server.allow-insecure: on</div><div>performance.flush-behind: on</div><div>performance.rda-cache-limit: 128MB</div><div>network.tcp-window-size: 1048576</div><div>performance.nfs.io-threads: off</div><div>performance.write-behind-window-size: 512MB</div><div>performance.nfs.write-behind-window-size: 4MB</div><div>performance.io-cache: on</div><div>performance.quick-read: on</div><div>features.cache-invalidation: on</div><div>features.cache-invalidation-timeout: 600</div><div>performance.cache-invalidation: on</div><div>performance.md-cache-timeout: 600</div><div>network.inode-lru-limit: 90000</div><div>performance.cache-size: 1GB</div><div>server.event-threads: 10</div><div>client.event-threads: 10</div><div>features.barrier: disable</div><div>transport.address-family: inet</div><div>nfs.disable: on</div><div>performance.client-io-threads: on</div><div>cluster.lookup-optimize: on</div><div>server.outstanding-rpc-limit: 2056</div><div>performance.stat-prefetch: on</div><div>performance.cache-refresh-timeout: 60</div><div>features.ctr-enabled: on</div><div>cluster.tier-mode: cache</div><div>cluster.tier-compact: on</div><div>cluster.tier-pause: off</div><div>cluster.tier-promote-frequency: 1500</div><div>features.record-counters: on</div><div>cluster.write-freq-threshold: 2</div><div>cluster.read-freq-threshold: 5</div><div>features.ctr-sql-db-cachesize: 262144</div><div>cluster.watermark-hi: 95</div><div>auto-delete: enable</div></div><div><br></div><div>It will take some time to get the logs together, I need to strip out potentially sensitive info, will update with them when I have them.</div><div><br></div><div>Any theories as to why the promotions / demotions only take place on one box but not both?</div><div><br></div><div>-Tom</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jan 18, 2018 at 5:12 AM, Hari Gowtham <span dir="ltr">&lt;<a href="mailto:hgowtham@redhat.com" target="_blank">hgowtham@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Tom,<br>

<br>

The volume info doesn&#39;t show the hot bricks. I think you have took the<br>

volume info output before attaching the hot tier.<br>

Can you send the volume info of the current setup where you see this issue.<br>

<br>

The logs you sent are from a later point in time. The issue is hit<br>

earlier than the logs what is available in the log. I need the logs<br>

from an earlier time.<br>

And along with the entire tier logs, can you send the glusterd and<br>

brick logs too?<br>

<br>

Rest of the comments are inline<br>

<span class=""><br>

On Wed, Jan 10, 2018 at 9:03 PM, Tom Fite &lt;<a href="mailto:tomfite@gmail.com">tomfite@gmail.com</a>&gt; wrote:<br>

&gt; I should add that additional testing has shown that only accessing files is<br>

&gt; held up, IO is not interrupted for existing transfers. I think this points<br>

&gt; to the heat metadata in the sqlite DB for the tier, is it possible that a<br>

&gt; table is temporarily locked while the promotion daemon runs so the calls to<br>

&gt; update the access count on files are blocked?<br>

&gt;<br>

&gt;<br>

&gt; On Wed, Jan 10, 2018 at 10:17 AM, Tom Fite &lt;<a href="mailto:tomfite@gmail.com">tomfite@gmail.com</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; The sizes of the files are extremely varied, there are millions of small<br>

&gt;&gt; (&lt;1 MB) files and thousands of files larger than 1 GB.<br>

<br>

</span>The tier use case is for bigger size files. not the best for files of<br>

smaller size.<br>

That can end up hindering the IOs.<br>

<span class=""><br>

&gt;&gt;<br>

&gt;&gt; Attached is the tier log for gluster1 and gluster2. These are full of<br>

&gt;&gt; &quot;demotion failed&quot; messages, which is also shown in the status:<br>

&gt;&gt;<br>

&gt;&gt; [root@pod-sjc1-gluster1 gv0]# gluster volume tier gv0 status<br>

&gt;&gt; Node                 Promoted files       Demoted files        Status<br>

&gt;&gt; run time in h:m:s<br>

&gt;&gt; ---------            ---------            ---------            ---------<br>

&gt;&gt; ---------<br>

&gt;&gt; localhost            25940                0                    in progress<br>

&gt;&gt; 112:21:49<br>

&gt;&gt; pod-sjc1-gluster2 0                    2917154              in progress<br>

&gt;&gt; 112:21:49<br>

&gt;&gt;<br>

&gt;&gt; Is it normal to have promotions and demotions only happen on each server<br>

&gt;&gt; but not both?<br>

<br>

</span>No. its not normal.<br>

<div class="HOEnZb"><div class="h5"><br>

&gt;&gt;<br>

&gt;&gt; Volume info:<br>

&gt;&gt;<br>

&gt;&gt; [root@pod-sjc1-gluster1 ~]# gluster volume info<br>

&gt;&gt;<br>

&gt;&gt; Volume Name: gv0<br>

&gt;&gt; Type: Distributed-Replicate<br>

&gt;&gt; Volume ID: d490a9ec-f9c8-4f10-a7f3-<wbr>e1b6d3ced196<br>

&gt;&gt; Status: Started<br>

&gt;&gt; Snapshot Count: 13<br>

&gt;&gt; Number of Bricks: 3 x 2 = 6<br>

&gt;&gt; Transport-type: tcp<br>

&gt;&gt; Bricks:<br>

&gt;&gt; Brick1: pod-sjc1-gluster1:/data/<wbr>brick1/gv0<br>

&gt;&gt; Brick2: pod-sjc1-gluster2:/data/<wbr>brick1/gv0<br>

&gt;&gt; Brick3: pod-sjc1-gluster1:/data/<wbr>brick2/gv0<br>

&gt;&gt; Brick4: pod-sjc1-gluster2:/data/<wbr>brick2/gv0<br>

&gt;&gt; Brick5: pod-sjc1-gluster1:/data/<wbr>brick3/gv0<br>

&gt;&gt; Brick6: pod-sjc1-gluster2:/data/<wbr>brick3/gv0<br>

&gt;&gt; Options Reconfigured:<br>

&gt;&gt; performance.cache-refresh-<wbr>timeout: 60<br>

&gt;&gt; performance.stat-prefetch: on<br>

&gt;&gt; server.allow-insecure: on<br>

&gt;&gt; performance.flush-behind: on<br>

&gt;&gt; performance.rda-cache-limit: 32MB<br>

&gt;&gt; network.tcp-window-size: 1048576<br>

&gt;&gt; performance.nfs.io-threads: on<br>

&gt;&gt; performance.write-behind-<wbr>window-size: 4MB<br>

&gt;&gt; performance.nfs.write-behind-<wbr>window-size: 512MB<br>

&gt;&gt; performance.io-cache: on<br>

&gt;&gt; performance.quick-read: on<br>

&gt;&gt; features.cache-invalidation: on<br>

&gt;&gt; features.cache-invalidation-<wbr>timeout: 600<br>

&gt;&gt; performance.cache-<wbr>invalidation: on<br>

&gt;&gt; performance.md-cache-timeout: 600<br>

&gt;&gt; network.inode-lru-limit: 90000<br>

&gt;&gt; performance.cache-size: 4GB<br>

&gt;&gt; server.event-threads: 16<br>

&gt;&gt; client.event-threads: 16<br>

&gt;&gt; features.barrier: disable<br>

&gt;&gt; transport.address-family: inet<br>

&gt;&gt; nfs.disable: on<br>

&gt;&gt; performance.client-io-threads: on<br>

&gt;&gt; cluster.lookup-optimize: on<br>

&gt;&gt; server.outstanding-rpc-limit: 1024<br>

&gt;&gt; auto-delete: enable<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; # gluster volume status<br>

&gt;&gt; Status of volume: gv0<br>

&gt;&gt; Gluster process                             TCP Port  RDMA Port  Online<br>

&gt;&gt; Pid<br>

&gt;&gt;<br>

&gt;&gt; ------------------------------<wbr>------------------------------<wbr>------------------<br>

&gt;&gt; Hot Bricks:<br>

&gt;&gt; Brick pod-sjc1-gluster2:/data/<br>

&gt;&gt; hot_tier/gv0                                49219     0          Y<br>

&gt;&gt; 26714<br>

&gt;&gt; Brick pod-sjc1-gluster1:/data/<br>

&gt;&gt; hot_tier/gv0                                49199     0          Y<br>

&gt;&gt; 21325<br>

&gt;&gt; Cold Bricks:<br>

&gt;&gt; Brick pod-sjc1-gluster1:/data/<br>

&gt;&gt; brick1/gv0                                  49152     0          Y<br>

&gt;&gt; 3178<br>

&gt;&gt; Brick pod-sjc1-gluster2:/data/<br>

&gt;&gt; brick1/gv0                                  49152     0          Y<br>

&gt;&gt; 4818<br>

&gt;&gt; Brick pod-sjc1-gluster1:/data/<br>

&gt;&gt; brick2/gv0                                  49153     0          Y<br>

&gt;&gt; 3186<br>

&gt;&gt; Brick pod-sjc1-gluster2:/data/<br>

&gt;&gt; brick2/gv0                                  49153     0          Y<br>

&gt;&gt; 4829<br>

&gt;&gt; Brick pod-sjc1-gluster1:/data/<br>

&gt;&gt; brick3/gv0                                  49154     0          Y<br>

&gt;&gt; 3194<br>

&gt;&gt; Brick pod-sjc1-gluster2:/data/<br>

&gt;&gt; brick3/gv0                                  49154     0          Y<br>

&gt;&gt; 4840<br>

&gt;&gt; Tier Daemon on localhost                    N/A       N/A        Y<br>

&gt;&gt; 20313<br>

&gt;&gt; Self-heal Daemon on localhost               N/A       N/A        Y<br>

&gt;&gt; 32023<br>

&gt;&gt; Tier Daemon on pod-sjc1-gluster1            N/A       N/A        Y<br>

&gt;&gt; 24758<br>

&gt;&gt; Self-heal Daemon on pod-sjc1-gluster2       N/A       N/A        Y<br>

&gt;&gt; 12349<br>

&gt;&gt;<br>

&gt;&gt; Task Status of Volume gv0<br>

&gt;&gt;<br>

&gt;&gt; ------------------------------<wbr>------------------------------<wbr>------------------<br>

&gt;&gt; There are no active volume tasks<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; On Tue, Jan 9, 2018 at 10:33 PM, Hari Gowtham &lt;<a href="mailto:hgowtham@redhat.com">hgowtham@redhat.com</a>&gt; wrote:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; Hi,<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; Can you send the volume info, and volume status output and the tier logs.<br>

&gt;&gt;&gt; And I need to know the size of the files that are being stored.<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; On Tue, Jan 9, 2018 at 9:51 PM, Tom Fite &lt;<a href="mailto:tomfite@gmail.com">tomfite@gmail.com</a>&gt; wrote:<br>

&gt;&gt;&gt; &gt; I&#39;ve recently enabled an SSD backed 2 TB hot tier on my 150 TB 2 server<br>

&gt;&gt;&gt; &gt; / 3<br>

&gt;&gt;&gt; &gt; bricks per server distributed replicated volume.<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; I&#39;m seeing IO get blocked across all client FUSE threads for 10 to 15<br>

&gt;&gt;&gt; &gt; seconds while the promotion daemon runs. I see the &#39;glustertierpro&#39;<br>

&gt;&gt;&gt; &gt; thread<br>

&gt;&gt;&gt; &gt; jump to 99% CPU usage on both boxes when these delays occur and they<br>

&gt;&gt;&gt; &gt; happen<br>

&gt;&gt;&gt; &gt; every 25 minutes (my tier-promote-frequency setting).<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; I suspect this has something to do with the heat database in sqlite,<br>

&gt;&gt;&gt; &gt; maybe<br>

&gt;&gt;&gt; &gt; something is getting locked while it runs the query to determine files<br>

&gt;&gt;&gt; &gt; to<br>

&gt;&gt;&gt; &gt; promote. My volume contains approximately 18 million files.<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; Has anybody else seen this? I suspect that these delays will get worse<br>

&gt;&gt;&gt; &gt; as I<br>

&gt;&gt;&gt; &gt; add more files to my volume which will cause significant problems.<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; Here are my hot tier settings:<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; # gluster volume get gv0 all | grep tier<br>

&gt;&gt;&gt; &gt; cluster.tier-pause                      off<br>

&gt;&gt;&gt; &gt; cluster.tier-promote-frequency          1500<br>

&gt;&gt;&gt; &gt; cluster.tier-demote-frequency           3600<br>

&gt;&gt;&gt; &gt; cluster.tier-mode                       cache<br>

&gt;&gt;&gt; &gt; cluster.tier-max-promote-file-<wbr>size      10485760<br>

&gt;&gt;&gt; &gt; cluster.tier-max-mb                     64000<br>

&gt;&gt;&gt; &gt; cluster.tier-max-files                  100000<br>

&gt;&gt;&gt; &gt; cluster.tier-query-limit                100<br>

&gt;&gt;&gt; &gt; cluster.tier-compact                    on<br>

&gt;&gt;&gt; &gt; cluster.tier-hot-compact-<wbr>frequency      86400<br>

&gt;&gt;&gt; &gt; cluster.tier-cold-compact-<wbr>frequency     86400<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; # gluster volume get gv0 all | grep threshold<br>

&gt;&gt;&gt; &gt; cluster.write-freq-threshold            2<br>

&gt;&gt;&gt; &gt; cluster.read-freq-threshold             5<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; # gluster volume get gv0 all | grep watermark<br>

&gt;&gt;&gt; &gt; cluster.watermark-hi                    92<br>

&gt;&gt;&gt; &gt; cluster.watermark-low                   75<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; ______________________________<wbr>_________________<br>

&gt;&gt;&gt; &gt; Gluster-users mailing list<br>

&gt;&gt;&gt; &gt; <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>

&gt;&gt;&gt; &gt; <a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; --<br>

&gt;&gt;&gt; Regards,<br>

&gt;&gt;&gt; Hari Gowtham.<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;<br>

<br>

<br>

<br>

</div></div><span class="HOEnZb"><font color="#888888">--<br>

Regards,<br>

Hari Gowtham.<br>

</font></span></blockquote></div><br></div>