[Gluster-users] Gluster very poor performance when copying small files (1x (2+1) = 3, SSD)
Sam McLeod
mailinglists at smcleod.net
Tue Mar 20 00:06:53 UTC 2018
Howdy all,
Sorry in Australia so most of your replies came in over night for me.
Note: At the end of this reply is a listing of all our volume settings (gluster get volname all).
Note 2: I really wish Gluster used Discourse for this kind of community troubleshooting an analysis, using a mailing list is really painful.
> On 19 Mar 2018, at 4:38 pm, TomK <tomkcpr at mdevsys.com> wrote:
>
> On 3/19/2018 1:07 AM, TomK wrote:
> A few numbers you could try:
>
> performance.cache-refresh-timeout Default: 1s
I've actually set this right up to 60 (seconds), I guess it's possible that's causing an issue but I thought that was more for forced eviction on idle files.
> cluster.stripe-block-size Default: 128KB
Hmm yes I wonder if it might be worth looking at the stripe-block-size, I forgot about this as it sounds like it's for striped volumes (now deprecated) only.
The issue with this is that I don't want to tune the volume just for small files and hurt the performance of lager I/O operations.
>
> Looks like others are having this sort of performance problem:
>
> http://lists.gluster.org/pipermail/gluster-users/2015-April/021487.html
>
> Some recommended values by one poster that might help out (https://forum.proxmox.com/threads/horribly-slow-gluster-performance.26319/) Going to try in my LAB and let you know:
>
>
> > GlusterFS 3.7 parameters:
GlusterFS 3.7 is really old so I'd be careful looking at settings / tuning for it.
> nfs.trusted-sync: on
Not using NFS.
> performance.cache-size: 1GB
Already set to 1024MB, but that's only for reads not writes.
> performance.io-thread-count: 16
That's my current setting.
> performance.write-behind-window-size: 8MB
Currently allowing even more cache up at 256MB.
> performance.readdir-ahead: on
That's my current setting (the default now I believe).
> client.event-threads: 8
That's my current setting (the default now I believe).
> server.event-threads: 8
That's my current setting (the default now I believe).
> cluster.quorum-type: auto
Not sure how that's going to impact small I/O performance.
I currently have this set to none, but do use an arbiter node.
> cluster.server-quorum-type: server
Not sure how that's going to impact small I/O performance.
I currently have this set to off, but do use an arbiter node.
> cluster.server-quorum-ratio: 51%
Not sure how that's going to impact small I/O performance.
I currently have this set to 0, but do use an arbiter node.
>
> > Kernel parameters:
> net.ipv4.tcp_slow_start_after_idle = 0
That's my current setting.
> net.ipv4.tcp_fin_timeout = 15
I've set this right down to 5.
> net.core.somaxconn = 65535
That's my current setting.
> vm.swappiness = 1
That's my current setting, we don't have swap - other than ZRAM enabled on any hosts.
> vm.dirty_ratio = 5
N/A as swap disabled (ZRAM only)
> vm.dirty_background_ratio = 2
N/A as swap disabled (ZRAM only)
> vm.min_free_kbytes = 524288 # this is on 128GB RAM
I have this set to vm.min_free_kbytes = 67584, I'd be worried that setting this high would cause OOM as per the official kernel docs:
min_free_kbytes:
This is used to force the Linux VM to keep a minimum number
of kilobytes free. The VM uses this number to compute a
watermark[WMARK_MIN] value for each lowmem zone in the system.
Each lowmem zone gets a number of reserved free pages based
proportionally on its size.
Some minimal amount of memory is needed to satisfy PF_MEMALLOC
allocations; if you set this to lower than 1024KB, your system will
become subtly broken, and prone to deadlock under high loads.
Setting this too high will OOM your machine instantly.
> On 20 Mar 2018, at 1:52 am, Rik Theys <Rik.Theys at esat.kuleuven.be> wrote:
>
> That's not really a fare comparison as you don't specify a blocksize.
> What does
>
> dd if=/dev/zero of=./some-file.bin bs=1M count=1000 oflag=direct
>
> give?
>
>
> Rik
DD is not going to give anyone particularly useful benchmarks, especially with small file sizes, in fact it's more likely to mislead you than be useful.
See my short post on fio here: https://smcleod.net/tech/2016/04/29/benchmarking-io.html <https://smcleod.net/tech/2016/04/29/benchmarking-io.html> , I believe it's one of the most useful tools for I/O benchmarking.
Just for a laugh I compared dd writes for 4k (small) writes between the client (gluster mounted on the cli) and a gluster host (to a directory on the same storage as the bricks).
The client came out faster, likely the direct I/O flag was not working as perhaps intended.
Client:
# dd if=/dev/zero of=./some-file.bin bs=4K count=4096 oflag=direct
4096+0 records in
4096+0 records out
16777216 bytes (17 MB) copied, 2.27839 s, 7.4 MB/s
Server:
dd if=/dev/zero of=./some-file.bin bs=4K count=4096 oflag=direct
4096+0 records in
4096+0 records out
16777216 bytes (17 MB) copied, 3.94093 s, 4.3 MB/s
> Note: At the end of this reply is a listing of all our volume settings (gluster get volname all).
Here is an output of all gluster volume settings as they currently stand:
# gluster volume get uat_storage all
Option Value
------ -----
cluster.lookup-unhashed on
cluster.lookup-optimize true
cluster.min-free-disk 10%
cluster.min-free-inodes 5%
cluster.rebalance-stats off
cluster.subvols-per-directory (null)
cluster.readdir-optimize true
cluster.rsync-hash-regex (null)
cluster.extra-hash-regex (null)
cluster.dht-xattr-name trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid off
cluster.rebal-throttle normal
cluster.lock-migration off
cluster.local-volume-name (null)
cluster.weighted-rebalance on
cluster.switch-pattern (null)
cluster.entry-change-log on
cluster.read-subvolume (null)
cluster.read-subvolume-index -1
cluster.read-hash-mode 1
cluster.background-self-heal-count 8
cluster.metadata-self-heal on
cluster.data-self-heal on
cluster.entry-self-heal on
cluster.self-heal-daemon on
cluster.heal-timeout 600
cluster.self-heal-window-size 1
cluster.data-change-log on
cluster.metadata-change-log on
cluster.data-self-heal-algorithm (null)
cluster.eager-lock true
disperse.eager-lock on
disperse.other-eager-lock on
cluster.quorum-type none
cluster.quorum-count (null)
cluster.choose-local true
cluster.self-heal-readdir-size 1KB
cluster.post-op-delay-secs 1
cluster.ensure-durability on
cluster.consistent-metadata no
cluster.heal-wait-queue-length 128
cluster.favorite-child-policy size
cluster.full-lock yes
cluster.stripe-block-size 128KB
cluster.stripe-coalesce true
diagnostics.latency-measurement off
diagnostics.dump-fd-stats off
diagnostics.count-fop-hits off
diagnostics.brick-log-level ERROR
diagnostics.client-log-level ERROR
diagnostics.brick-sys-log-level CRITICAL
diagnostics.client-sys-log-level CRITICAL
diagnostics.brick-logger (null)
diagnostics.client-logger (null)
diagnostics.brick-log-format (null)
diagnostics.client-log-format (null)
diagnostics.brick-log-buf-size 5
diagnostics.client-log-buf-size 5
diagnostics.brick-log-flush-timeout 120
diagnostics.client-log-flush-timeout 120
diagnostics.stats-dump-interval 0
diagnostics.fop-sample-interval 0
diagnostics.stats-dump-format json
diagnostics.fop-sample-buf-size 65535
diagnostics.stats-dnscache-ttl-sec 86400
performance.cache-max-file-size 6MB
performance.cache-min-file-size 0
performance.cache-refresh-timeout 60
performance.cache-priority
performance.cache-size 1024MB
performance.io-thread-count 16
performance.high-prio-threads 16
performance.normal-prio-threads 16
performance.low-prio-threads 16
performance.least-prio-threads 1
performance.enable-least-priority on
performance.cache-size 1024MB
performance.flush-behind on
performance.nfs.flush-behind on
performance.write-behind-window-size 256MB
performance.resync-failed-syncs-after-fsyncoff
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct off
performance.nfs.strict-o-direct off
performance.strict-write-ordering off
performance.nfs.strict-write-ordering off
performance.write-behind-trickling-writeson
performance.nfs.write-behind-trickling-writeson
performance.lazy-open yes
performance.read-after-open no
performance.read-ahead-page-count 4
performance.md-cache-timeout 600
performance.cache-swift-metadata true
performance.cache-samba-metadata false
performance.cache-capability-xattrs true
performance.cache-ima-xattrs true
features.encryption off
encryption.master-key (null)
encryption.data-key-size 256
encryption.block-size 4096
network.frame-timeout 1800
network.ping-timeout 15
network.tcp-window-size (null)
features.lock-heal off
features.grace-timeout 10
network.remote-dio disable
client.event-threads 8
client.tcp-user-timeout 0
client.keepalive-time 20
client.keepalive-interval 2
client.keepalive-count 9
network.tcp-window-size (null)
network.inode-lru-limit 50000
auth.allow *
auth.reject (null)
transport.keepalive 1
server.allow-insecure (null)
server.root-squash off
server.anonuid 65534
server.anongid 65534
server.statedump-path /var/run/gluster
server.outstanding-rpc-limit 256
features.lock-heal off
features.grace-timeout 10
server.ssl (null)
auth.ssl-allow *
server.manage-gids off
server.dynamic-auth on
client.send-gids on
server.gid-timeout 300
server.own-thread (null)
server.event-threads 8
server.tcp-user-timeout 0
server.keepalive-time 20
server.keepalive-interval 2
server.keepalive-count 9
transport.listen-backlog 2048
ssl.own-cert (null)
ssl.private-key (null)
ssl.ca-list (null)
ssl.crl-path (null)
ssl.certificate-depth (null)
ssl.cipher-list (null)
ssl.dh-param (null)
ssl.ec-curve (null)
transport.address-family inet
performance.write-behind on
performance.read-ahead on
performance.readdir-ahead on
performance.io-cache on
performance.quick-read on
performance.open-behind on
performance.nl-cache off
performance.stat-prefetch true
performance.client-io-threads true
performance.nfs.write-behind on
performance.nfs.read-ahead off
performance.nfs.io-cache off
performance.nfs.quick-read off
performance.nfs.stat-prefetch off
performance.nfs.io-threads off
performance.force-readdirp true
performance.cache-invalidation true
features.uss off
features.snapshot-directory .snaps
features.show-snapshot-directory off
network.compression off
network.compression.window-size -15
network.compression.mem-level 8
network.compression.min-size 0
network.compression.compression-level -1
network.compression.debug false
features.limit-usage (null)
features.default-soft-limit 80%
features.soft-timeout 60
features.hard-timeout 5
features.alert-time 86400
features.quota-deem-statfs off
geo-replication.indexing off
geo-replication.indexing off
geo-replication.ignore-pid-check off
geo-replication.ignore-pid-check off
features.quota off
features.inode-quota off
features.bitrot disable
debug.trace off
debug.log-history no
debug.log-file no
debug.exclude-ops (null)
debug.include-ops (null)
debug.error-gen off
debug.error-failure (null)
debug.error-number (null)
debug.random-failure off
debug.error-fops (null)
nfs.disable on
features.read-only off
features.worm off
features.worm-file-level off
features.worm-files-deletable on
features.default-retention-period 120
features.retention-mode relax
features.auto-commit-period 180
storage.linux-aio off
storage.batch-fsync-mode reverse-fsync
storage.batch-fsync-delay-usec 0
storage.owner-uid -1
storage.owner-gid -1
storage.node-uuid-pathinfo off
storage.health-check-interval 30
storage.build-pgfid off
storage.gfid2path on
storage.gfid2path-separator :
storage.reserve 1
storage.bd-aio off
config.gfproxyd off
cluster.server-quorum-type off
cluster.server-quorum-ratio 0
changelog.changelog off
changelog.changelog-dir (null)
changelog.encoding ascii
changelog.rollover-time 15
changelog.fsync-interval 5
changelog.changelog-barrier-timeout 120
changelog.capture-del-path off
features.barrier disable
features.barrier-timeout 120
features.trash off
features.trash-dir .trashcan
features.trash-eliminate-path (null)
features.trash-max-filesize 5MB
features.trash-internal-op off
cluster.enable-shared-storage disable
cluster.write-freq-threshold 0
cluster.read-freq-threshold 0
cluster.tier-pause off
cluster.tier-promote-frequency 120
cluster.tier-demote-frequency 3600
cluster.watermark-hi 90
cluster.watermark-low 75
cluster.tier-mode cache
cluster.tier-max-promote-file-size 0
cluster.tier-max-mb 4000
cluster.tier-max-files 10000
cluster.tier-query-limit 100
cluster.tier-compact on
cluster.tier-hot-compact-frequency 604800
cluster.tier-cold-compact-frequency 604800
features.ctr-enabled off
features.record-counters off
features.ctr-record-metadata-heat off
features.ctr_link_consistency off
features.ctr_lookupheal_link_timeout 300
features.ctr_lookupheal_inode_timeout 300
features.ctr-sql-db-cachesize 12500
features.ctr-sql-db-wal-autocheckpoint 25000
features.selinux on
locks.trace off
locks.mandatory-locking off
cluster.disperse-self-heal-daemon enable
cluster.quorum-reads no
client.bind-insecure (null)
features.shard off
features.shard-block-size 64MB
features.scrub-throttle lazy
features.scrub-freq biweekly
features.scrub false
features.expiry-time 120
features.cache-invalidation true
features.cache-invalidation-timeout 600
features.leases off
features.lease-lock-recall-timeout 60
disperse.background-heals 8
disperse.heal-wait-qlength 128
cluster.heal-timeout 600
dht.force-readdirp on
disperse.read-policy round-robin
cluster.shd-max-threads 1
cluster.shd-wait-qlength 1024
cluster.locking-scheme full
cluster.granular-entry-heal no
features.locks-revocation-secs 0
features.locks-revocation-clear-all false
features.locks-revocation-max-blocked 0
features.locks-monkey-unlocking false
disperse.shd-max-threads 1
disperse.shd-wait-qlength 1024
disperse.cpu-extensions auto
disperse.self-heal-window-size 1
cluster.use-compound-fops true
performance.parallel-readdir off
performance.rda-request-size 131072
performance.rda-low-wmark 4096
performance.rda-high-wmark 128KB
performance.rda-cache-limit 256MB
performance.nl-cache-positive-entry false
performance.nl-cache-limit 10MB
performance.nl-cache-timeout 60
cluster.brick-multiplex off
cluster.max-bricks-per-process 0
disperse.optimistic-change-log on
cluster.halo-enabled False
cluster.halo-shd-max-latency 99999
cluster.halo-nfsd-max-latency 5
cluster.halo-max-latency 5
cluster.halo-max-replicas 99999
cluster.halo-min-replicas 2
debug.delay-gen off
delay-gen.delay-percentage 10%
delay-gen.delay-duration 100000
delay-gen.enable (null)
disperse.parallel-writes on
--
Sam McLeod (protoporpoise on IRC)
https://smcleod.net
https://twitter.com/s_mcleod
Words are my own opinions and do not necessarily represent those of my employer or partners.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180320/e16c1242/attachment.html>
More information about the Gluster-users
mailing list