<div dir="ltr">Thanks for the info, Hari. Sorry about the bad gluster volume info, I grabbed that from a file not realizing it was out of date. Here's a current configuration showing the active hot tier:<div><br></div><div><div>[root@pod-sjc1-gluster1 ~]# gluster volume info</div><div> </div><div>Volume Name: gv0</div><div>Type: Tier</div><div>Volume ID: d490a9ec-f9c8-4f10-a7f3-e1b6d3ced196</div><div>Status: Started</div><div>Snapshot Count: 13</div><div>Number of Bricks: 8</div><div>Transport-type: tcp</div><div>Hot Tier :</div><div>Hot Tier Type : Replicate</div><div>Number of Bricks: 1 x 2 = 2</div><div>Brick1: pod-sjc1-gluster2:/data/hot_tier/gv0</div><div>Brick2: pod-sjc1-gluster1:/data/hot_tier/gv0</div><div>Cold Tier:</div><div>Cold Tier Type : Distributed-Replicate</div><div>Number of Bricks: 3 x 2 = 6</div><div>Brick3: pod-sjc1-gluster1:/data/brick1/gv0</div><div>Brick4: pod-sjc1-gluster2:/data/brick1/gv0</div><div>Brick5: pod-sjc1-gluster1:/data/brick2/gv0</div><div>Brick6: pod-sjc1-gluster2:/data/brick2/gv0</div><div>Brick7: pod-sjc1-gluster1:/data/brick3/gv0</div><div>Brick8: pod-sjc1-gluster2:/data/brick3/gv0</div><div>Options Reconfigured:</div><div>performance.rda-low-wmark: 4KB</div><div>performance.rda-request-size: 128KB</div><div>storage.build-pgfid: on</div><div>cluster.watermark-low: 50</div><div>performance.readdir-ahead: off</div><div>cluster.tier-cold-compact-frequency: 86400</div><div>cluster.tier-hot-compact-frequency: 86400</div><div>features.ctr-sql-db-wal-autocheckpoint: 2500</div><div>cluster.tier-max-mb: 64000</div><div>cluster.tier-max-promote-file-size: 10485760</div><div>cluster.tier-max-files: 100000</div><div>cluster.tier-demote-frequency: 3600</div><div>server.allow-insecure: on</div><div>performance.flush-behind: on</div><div>performance.rda-cache-limit: 128MB</div><div>network.tcp-window-size: 1048576</div><div>performance.nfs.io-threads: off</div><div>performance.write-behind-window-size: 512MB</div><div>performance.nfs.write-behind-window-size: 4MB</div><div>performance.io-cache: on</div><div>performance.quick-read: on</div><div>features.cache-invalidation: on</div><div>features.cache-invalidation-timeout: 600</div><div>performance.cache-invalidation: on</div><div>performance.md-cache-timeout: 600</div><div>network.inode-lru-limit: 90000</div><div>performance.cache-size: 1GB</div><div>server.event-threads: 10</div><div>client.event-threads: 10</div><div>features.barrier: disable</div><div>transport.address-family: inet</div><div>nfs.disable: on</div><div>performance.client-io-threads: on</div><div>cluster.lookup-optimize: on</div><div>server.outstanding-rpc-limit: 2056</div><div>performance.stat-prefetch: on</div><div>performance.cache-refresh-timeout: 60</div><div>features.ctr-enabled: on</div><div>cluster.tier-mode: cache</div><div>cluster.tier-compact: on</div><div>cluster.tier-pause: off</div><div>cluster.tier-promote-frequency: 1500</div><div>features.record-counters: on</div><div>cluster.write-freq-threshold: 2</div><div>cluster.read-freq-threshold: 5</div><div>features.ctr-sql-db-cachesize: 262144</div><div>cluster.watermark-hi: 95</div><div>auto-delete: enable</div></div><div><br></div><div>It will take some time to get the logs together, I need to strip out potentially sensitive info, will update with them when I have them.</div><div><br></div><div>Any theories as to why the promotions / demotions only take place on one box but not both?</div><div><br></div><div>-Tom</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jan 18, 2018 at 5:12 AM, Hari Gowtham <span dir="ltr"><<a href="mailto:hgowtham@redhat.com" target="_blank">hgowtham@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Tom,<br>
<br>
The volume info doesn't show the hot bricks. I think you have took the<br>
volume info output before attaching the hot tier.<br>
Can you send the volume info of the current setup where you see this issue.<br>
<br>
The logs you sent are from a later point in time. The issue is hit<br>
earlier than the logs what is available in the log. I need the logs<br>
from an earlier time.<br>
And along with the entire tier logs, can you send the glusterd and<br>
brick logs too?<br>
<br>
Rest of the comments are inline<br>
<span class=""><br>
On Wed, Jan 10, 2018 at 9:03 PM, Tom Fite <<a href="mailto:tomfite@gmail.com">tomfite@gmail.com</a>> wrote:<br>
> I should add that additional testing has shown that only accessing files is<br>
> held up, IO is not interrupted for existing transfers. I think this points<br>
> to the heat metadata in the sqlite DB for the tier, is it possible that a<br>
> table is temporarily locked while the promotion daemon runs so the calls to<br>
> update the access count on files are blocked?<br>
><br>
><br>
> On Wed, Jan 10, 2018 at 10:17 AM, Tom Fite <<a href="mailto:tomfite@gmail.com">tomfite@gmail.com</a>> wrote:<br>
>><br>
>> The sizes of the files are extremely varied, there are millions of small<br>
>> (<1 MB) files and thousands of files larger than 1 GB.<br>
<br>
</span>The tier use case is for bigger size files. not the best for files of<br>
smaller size.<br>
That can end up hindering the IOs.<br>
<span class=""><br>
>><br>
>> Attached is the tier log for gluster1 and gluster2. These are full of<br>
>> "demotion failed" messages, which is also shown in the status:<br>
>><br>
>> [root@pod-sjc1-gluster1 gv0]# gluster volume tier gv0 status<br>
>> Node Promoted files Demoted files Status<br>
>> run time in h:m:s<br>
>> --------- --------- --------- ---------<br>
>> ---------<br>
>> localhost 25940 0 in progress<br>
>> 112:21:49<br>
>> pod-sjc1-gluster2 0 2917154 in progress<br>
>> 112:21:49<br>
>><br>
>> Is it normal to have promotions and demotions only happen on each server<br>
>> but not both?<br>
<br>
</span>No. its not normal.<br>
<div class="HOEnZb"><div class="h5"><br>
>><br>
>> Volume info:<br>
>><br>
>> [root@pod-sjc1-gluster1 ~]# gluster volume info<br>
>><br>
>> Volume Name: gv0<br>
>> Type: Distributed-Replicate<br>
>> Volume ID: d490a9ec-f9c8-4f10-a7f3-<wbr>e1b6d3ced196<br>
>> Status: Started<br>
>> Snapshot Count: 13<br>
>> Number of Bricks: 3 x 2 = 6<br>
>> Transport-type: tcp<br>
>> Bricks:<br>
>> Brick1: pod-sjc1-gluster1:/data/<wbr>brick1/gv0<br>
>> Brick2: pod-sjc1-gluster2:/data/<wbr>brick1/gv0<br>
>> Brick3: pod-sjc1-gluster1:/data/<wbr>brick2/gv0<br>
>> Brick4: pod-sjc1-gluster2:/data/<wbr>brick2/gv0<br>
>> Brick5: pod-sjc1-gluster1:/data/<wbr>brick3/gv0<br>
>> Brick6: pod-sjc1-gluster2:/data/<wbr>brick3/gv0<br>
>> Options Reconfigured:<br>
>> performance.cache-refresh-<wbr>timeout: 60<br>
>> performance.stat-prefetch: on<br>
>> server.allow-insecure: on<br>
>> performance.flush-behind: on<br>
>> performance.rda-cache-limit: 32MB<br>
>> network.tcp-window-size: 1048576<br>
>> performance.nfs.io-threads: on<br>
>> performance.write-behind-<wbr>window-size: 4MB<br>
>> performance.nfs.write-behind-<wbr>window-size: 512MB<br>
>> performance.io-cache: on<br>
>> performance.quick-read: on<br>
>> features.cache-invalidation: on<br>
>> features.cache-invalidation-<wbr>timeout: 600<br>
>> performance.cache-<wbr>invalidation: on<br>
>> performance.md-cache-timeout: 600<br>
>> network.inode-lru-limit: 90000<br>
>> performance.cache-size: 4GB<br>
>> server.event-threads: 16<br>
>> client.event-threads: 16<br>
>> features.barrier: disable<br>
>> transport.address-family: inet<br>
>> nfs.disable: on<br>
>> performance.client-io-threads: on<br>
>> cluster.lookup-optimize: on<br>
>> server.outstanding-rpc-limit: 1024<br>
>> auto-delete: enable<br>
>><br>
>><br>
>> # gluster volume status<br>
>> Status of volume: gv0<br>
>> Gluster process TCP Port RDMA Port Online<br>
>> Pid<br>
>><br>
>> ------------------------------<wbr>------------------------------<wbr>------------------<br>
>> Hot Bricks:<br>
>> Brick pod-sjc1-gluster2:/data/<br>
>> hot_tier/gv0 49219 0 Y<br>
>> 26714<br>
>> Brick pod-sjc1-gluster1:/data/<br>
>> hot_tier/gv0 49199 0 Y<br>
>> 21325<br>
>> Cold Bricks:<br>
>> Brick pod-sjc1-gluster1:/data/<br>
>> brick1/gv0 49152 0 Y<br>
>> 3178<br>
>> Brick pod-sjc1-gluster2:/data/<br>
>> brick1/gv0 49152 0 Y<br>
>> 4818<br>
>> Brick pod-sjc1-gluster1:/data/<br>
>> brick2/gv0 49153 0 Y<br>
>> 3186<br>
>> Brick pod-sjc1-gluster2:/data/<br>
>> brick2/gv0 49153 0 Y<br>
>> 4829<br>
>> Brick pod-sjc1-gluster1:/data/<br>
>> brick3/gv0 49154 0 Y<br>
>> 3194<br>
>> Brick pod-sjc1-gluster2:/data/<br>
>> brick3/gv0 49154 0 Y<br>
>> 4840<br>
>> Tier Daemon on localhost N/A N/A Y<br>
>> 20313<br>
>> Self-heal Daemon on localhost N/A N/A Y<br>
>> 32023<br>
>> Tier Daemon on pod-sjc1-gluster1 N/A N/A Y<br>
>> 24758<br>
>> Self-heal Daemon on pod-sjc1-gluster2 N/A N/A Y<br>
>> 12349<br>
>><br>
>> Task Status of Volume gv0<br>
>><br>
>> ------------------------------<wbr>------------------------------<wbr>------------------<br>
>> There are no active volume tasks<br>
>><br>
>><br>
>> On Tue, Jan 9, 2018 at 10:33 PM, Hari Gowtham <<a href="mailto:hgowtham@redhat.com">hgowtham@redhat.com</a>> wrote:<br>
>>><br>
>>> Hi,<br>
>>><br>
>>> Can you send the volume info, and volume status output and the tier logs.<br>
>>> And I need to know the size of the files that are being stored.<br>
>>><br>
>>> On Tue, Jan 9, 2018 at 9:51 PM, Tom Fite <<a href="mailto:tomfite@gmail.com">tomfite@gmail.com</a>> wrote:<br>
>>> > I've recently enabled an SSD backed 2 TB hot tier on my 150 TB 2 server<br>
>>> > / 3<br>
>>> > bricks per server distributed replicated volume.<br>
>>> ><br>
>>> > I'm seeing IO get blocked across all client FUSE threads for 10 to 15<br>
>>> > seconds while the promotion daemon runs. I see the 'glustertierpro'<br>
>>> > thread<br>
>>> > jump to 99% CPU usage on both boxes when these delays occur and they<br>
>>> > happen<br>
>>> > every 25 minutes (my tier-promote-frequency setting).<br>
>>> ><br>
>>> > I suspect this has something to do with the heat database in sqlite,<br>
>>> > maybe<br>
>>> > something is getting locked while it runs the query to determine files<br>
>>> > to<br>
>>> > promote. My volume contains approximately 18 million files.<br>
>>> ><br>
>>> > Has anybody else seen this? I suspect that these delays will get worse<br>
>>> > as I<br>
>>> > add more files to my volume which will cause significant problems.<br>
>>> ><br>
>>> > Here are my hot tier settings:<br>
>>> ><br>
>>> > # gluster volume get gv0 all | grep tier<br>
>>> > cluster.tier-pause off<br>
>>> > cluster.tier-promote-frequency 1500<br>
>>> > cluster.tier-demote-frequency 3600<br>
>>> > cluster.tier-mode cache<br>
>>> > cluster.tier-max-promote-file-<wbr>size 10485760<br>
>>> > cluster.tier-max-mb 64000<br>
>>> > cluster.tier-max-files 100000<br>
>>> > cluster.tier-query-limit 100<br>
>>> > cluster.tier-compact on<br>
>>> > cluster.tier-hot-compact-<wbr>frequency 86400<br>
>>> > cluster.tier-cold-compact-<wbr>frequency 86400<br>
>>> ><br>
>>> > # gluster volume get gv0 all | grep threshold<br>
>>> > cluster.write-freq-threshold 2<br>
>>> > cluster.read-freq-threshold 5<br>
>>> ><br>
>>> > # gluster volume get gv0 all | grep watermark<br>
>>> > cluster.watermark-hi 92<br>
>>> > cluster.watermark-low 75<br>
>>> ><br>
>>> > ______________________________<wbr>_________________<br>
>>> > Gluster-users mailing list<br>
>>> > <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
>>> > <a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br>
>>><br>
>>><br>
>>><br>
>>> --<br>
>>> Regards,<br>
>>> Hari Gowtham.<br>
>><br>
>><br>
><br>
<br>
<br>
<br>
</div></div><span class="HOEnZb"><font color="#888888">--<br>
Regards,<br>
Hari Gowtham.<br>
</font></span></blockquote></div><br></div>