[Gluster-users] Blocking IO when hot tier promotion daemon runs

Thu Jan 18 16:24:38 UTC 2018

Thanks for the info, Hari. Sorry about the bad gluster volume info, I
grabbed that from a file not realizing it was out of date. Here's a current
configuration showing the active hot tier:

[root at pod-sjc1-gluster1 ~]# gluster volume info

Volume Name: gv0
Type: Tier
Volume ID: d490a9ec-f9c8-4f10-a7f3-e1b6d3ced196
Status: Started
Snapshot Count: 13
Number of Bricks: 8
Transport-type: tcp
Hot Tier :
Hot Tier Type : Replicate
Number of Bricks: 1 x 2 = 2
Brick1: pod-sjc1-gluster2:/data/hot_tier/gv0
Brick2: pod-sjc1-gluster1:/data/hot_tier/gv0
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 3 x 2 = 6
Brick3: pod-sjc1-gluster1:/data/brick1/gv0
Brick4: pod-sjc1-gluster2:/data/brick1/gv0
Brick5: pod-sjc1-gluster1:/data/brick2/gv0
Brick6: pod-sjc1-gluster2:/data/brick2/gv0
Brick7: pod-sjc1-gluster1:/data/brick3/gv0
Brick8: pod-sjc1-gluster2:/data/brick3/gv0
Options Reconfigured:
performance.rda-low-wmark: 4KB
performance.rda-request-size: 128KB
storage.build-pgfid: on
cluster.watermark-low: 50
performance.readdir-ahead: off
cluster.tier-cold-compact-frequency: 86400
cluster.tier-hot-compact-frequency: 86400
features.ctr-sql-db-wal-autocheckpoint: 2500
cluster.tier-max-mb: 64000
cluster.tier-max-promote-file-size: 10485760
cluster.tier-max-files: 100000
cluster.tier-demote-frequency: 3600
server.allow-insecure: on
performance.flush-behind: on
performance.rda-cache-limit: 128MB
network.tcp-window-size: 1048576
performance.nfs.io-threads: off
performance.write-behind-window-size: 512MB
performance.nfs.write-behind-window-size: 4MB
performance.io-cache: on
performance.quick-read: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 90000
performance.cache-size: 1GB
server.event-threads: 10
client.event-threads: 10
features.barrier: disable
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on
cluster.lookup-optimize: on
server.outstanding-rpc-limit: 2056
performance.stat-prefetch: on
performance.cache-refresh-timeout: 60
features.ctr-enabled: on
cluster.tier-mode: cache
cluster.tier-compact: on
cluster.tier-pause: off
cluster.tier-promote-frequency: 1500
features.record-counters: on
cluster.write-freq-threshold: 2
cluster.read-freq-threshold: 5
features.ctr-sql-db-cachesize: 262144
cluster.watermark-hi: 95
auto-delete: enable

It will take some time to get the logs together, I need to strip out
potentially sensitive info, will update with them when I have them.

Any theories as to why the promotions / demotions only take place on one
box but not both?

-Tom

On Thu, Jan 18, 2018 at 5:12 AM, Hari Gowtham <hgowtham at redhat.com> wrote:

> Hi Tom,
>
> The volume info doesn't show the hot bricks. I think you have took the
> volume info output before attaching the hot tier.
> Can you send the volume info of the current setup where you see this issue.
>
> The logs you sent are from a later point in time. The issue is hit
> earlier than the logs what is available in the log. I need the logs
> from an earlier time.
> And along with the entire tier logs, can you send the glusterd and
> brick logs too?
>
> Rest of the comments are inline
>
> On Wed, Jan 10, 2018 at 9:03 PM, Tom Fite <tomfite at gmail.com> wrote:
> > I should add that additional testing has shown that only accessing files
> is
> > held up, IO is not interrupted for existing transfers. I think this
> points
> > to the heat metadata in the sqlite DB for the tier, is it possible that a
> > table is temporarily locked while the promotion daemon runs so the calls
> to
> > update the access count on files are blocked?
> >
> >
> > On Wed, Jan 10, 2018 at 10:17 AM, Tom Fite <tomfite at gmail.com> wrote:
> >>
> >> The sizes of the files are extremely varied, there are millions of small
> >> (<1 MB) files and thousands of files larger than 1 GB.
>
> The tier use case is for bigger size files. not the best for files of
> smaller size.
> That can end up hindering the IOs.
>
> >>
> >> Attached is the tier log for gluster1 and gluster2. These are full of
> >> "demotion failed" messages, which is also shown in the status:
> >>
> >> [root at pod-sjc1-gluster1 gv0]# gluster volume tier gv0 status
> >> Node                 Promoted files       Demoted files        Status
> >> run time in h:m:s
> >> ---------            ---------            ---------            ---------
> >> ---------
> >> localhost            25940                0                    in
> progress
> >> 112:21:49
> >> pod-sjc1-gluster2 0                    2917154              in progress
> >> 112:21:49
> >>
> >> Is it normal to have promotions and demotions only happen on each server
> >> but not both?
>
> No. its not normal.
>
> >>
> >> Volume info:
> >>
> >> [root at pod-sjc1-gluster1 ~]# gluster volume info
> >>
> >> Volume Name: gv0
> >> Type: Distributed-Replicate
> >> Volume ID: d490a9ec-f9c8-4f10-a7f3-e1b6d3ced196
> >> Status: Started
> >> Snapshot Count: 13
> >> Number of Bricks: 3 x 2 = 6
> >> Transport-type: tcp
> >> Bricks:
> >> Brick1: pod-sjc1-gluster1:/data/brick1/gv0
> >> Brick2: pod-sjc1-gluster2:/data/brick1/gv0
> >> Brick3: pod-sjc1-gluster1:/data/brick2/gv0
> >> Brick4: pod-sjc1-gluster2:/data/brick2/gv0
> >> Brick5: pod-sjc1-gluster1:/data/brick3/gv0
> >> Brick6: pod-sjc1-gluster2:/data/brick3/gv0
> >> Options Reconfigured:
> >> performance.cache-refresh-timeout: 60
> >> performance.stat-prefetch: on
> >> server.allow-insecure: on
> >> performance.flush-behind: on
> >> performance.rda-cache-limit: 32MB
> >> network.tcp-window-size: 1048576
> >> performance.nfs.io-threads: on
> >> performance.write-behind-window-size: 4MB
> >> performance.nfs.write-behind-window-size: 512MB
> >> performance.io-cache: on
> >> performance.quick-read: on
> >> features.cache-invalidation: on
> >> features.cache-invalidation-timeout: 600
> >> performance.cache-invalidation: on
> >> performance.md-cache-timeout: 600
> >> network.inode-lru-limit: 90000
> >> performance.cache-size: 4GB
> >> server.event-threads: 16
> >> client.event-threads: 16
> >> features.barrier: disable
> >> transport.address-family: inet
> >> nfs.disable: on
> >> performance.client-io-threads: on
> >> cluster.lookup-optimize: on
> >> server.outstanding-rpc-limit: 1024
> >> auto-delete: enable
> >>
> >>
> >> # gluster volume status
> >> Status of volume: gv0
> >> Gluster process                             TCP Port  RDMA Port  Online
> >> Pid
> >>
> >> ------------------------------------------------------------
> ------------------
> >> Hot Bricks:
> >> Brick pod-sjc1-gluster2:/data/
> >> hot_tier/gv0                                49219     0          Y
> >> 26714
> >> Brick pod-sjc1-gluster1:/data/
> >> hot_tier/gv0                                49199     0          Y
> >> 21325
> >> Cold Bricks:
> >> Brick pod-sjc1-gluster1:/data/
> >> brick1/gv0                                  49152     0          Y
> >> 3178
> >> Brick pod-sjc1-gluster2:/data/
> >> brick1/gv0                                  49152     0          Y
> >> 4818
> >> Brick pod-sjc1-gluster1:/data/
> >> brick2/gv0                                  49153     0          Y
> >> 3186
> >> Brick pod-sjc1-gluster2:/data/
> >> brick2/gv0                                  49153     0          Y
> >> 4829
> >> Brick pod-sjc1-gluster1:/data/
> >> brick3/gv0                                  49154     0          Y
> >> 3194
> >> Brick pod-sjc1-gluster2:/data/
> >> brick3/gv0                                  49154     0          Y
> >> 4840
> >> Tier Daemon on localhost                    N/A       N/A        Y
> >> 20313
> >> Self-heal Daemon on localhost               N/A       N/A        Y
> >> 32023
> >> Tier Daemon on pod-sjc1-gluster1            N/A       N/A        Y
> >> 24758
> >> Self-heal Daemon on pod-sjc1-gluster2       N/A       N/A        Y
> >> 12349
> >>
> >> Task Status of Volume gv0
> >>
> >> ------------------------------------------------------------
> ------------------
> >> There are no active volume tasks
> >>
> >>
> >> On Tue, Jan 9, 2018 at 10:33 PM, Hari Gowtham <hgowtham at redhat.com>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Can you send the volume info, and volume status output and the tier
> logs.
> >>> And I need to know the size of the files that are being stored.
> >>>
> >>> On Tue, Jan 9, 2018 at 9:51 PM, Tom Fite <tomfite at gmail.com> wrote:
> >>> > I've recently enabled an SSD backed 2 TB hot tier on my 150 TB 2
> server
> >>> > / 3
> >>> > bricks per server distributed replicated volume.
> >>> >
> >>> > I'm seeing IO get blocked across all client FUSE threads for 10 to 15
> >>> > seconds while the promotion daemon runs. I see the 'glustertierpro'
> >>> > thread
> >>> > jump to 99% CPU usage on both boxes when these delays occur and they
> >>> > happen
> >>> > every 25 minutes (my tier-promote-frequency setting).
> >>> >
> >>> > I suspect this has something to do with the heat database in sqlite,
> >>> > maybe
> >>> > something is getting locked while it runs the query to determine
> files
> >>> > to
> >>> > promote. My volume contains approximately 18 million files.
> >>> >
> >>> > Has anybody else seen this? I suspect that these delays will get
> worse
> >>> > as I
> >>> > add more files to my volume which will cause significant problems.
> >>> >
> >>> > Here are my hot tier settings:
> >>> >
> >>> > # gluster volume get gv0 all | grep tier
> >>> > cluster.tier-pause                      off
> >>> > cluster.tier-promote-frequency          1500
> >>> > cluster.tier-demote-frequency           3600
> >>> > cluster.tier-mode                       cache
> >>> > cluster.tier-max-promote-file-size      10485760
> >>> > cluster.tier-max-mb                     64000
> >>> > cluster.tier-max-files                  100000
> >>> > cluster.tier-query-limit                100
> >>> > cluster.tier-compact                    on
> >>> > cluster.tier-hot-compact-frequency      86400
> >>> > cluster.tier-cold-compact-frequency     86400
> >>> >
> >>> > # gluster volume get gv0 all | grep threshold
> >>> > cluster.write-freq-threshold            2
> >>> > cluster.read-freq-threshold             5
> >>> >
> >>> > # gluster volume get gv0 all | grep watermark
> >>> > cluster.watermark-hi                    92
> >>> > cluster.watermark-low                   75
> >>> >
> >>> > _______________________________________________
> >>> > Gluster-users mailing list
> >>> > Gluster-users at gluster.org
> >>> > http://lists.gluster.org/mailman/listinfo/gluster-users
> >>>
> >>>
> >>>
> >>> --
> >>> Regards,
> >>> Hari Gowtham.
> >>
> >>
> >
>
>
>
> --
> Regards,
> Hari Gowtham.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180118/d58a415f/attachment.html>