[Gluster-users] Gluster very poor performance when copying small files (1x (2+1) = 3, SSD)

Tue Mar 20 00:06:53 UTC 2018

Howdy all,

Sorry in Australia so most of your replies came in over night for me.

Note: At the end of this reply is a listing of all our volume settings (gluster get volname all).
Note 2: I really wish Gluster used Discourse for this kind of community troubleshooting an analysis, using a mailing list is really painful.

> On 19 Mar 2018, at 4:38 pm, TomK <tomkcpr at mdevsys.com> wrote:
> On 3/19/2018 1:07 AM, TomK wrote:
> A few numbers you could try:
> performance.cache-refresh-timeout	Default: 1s

I've actually set this right up to 60 (seconds), I guess it's possible that's causing an issue but I thought that was more for forced eviction on idle files.

> cluster.stripe-block-size		Default: 128KB

Hmm yes I wonder if it might be worth looking at the stripe-block-size, I forgot about this as it sounds like it's for striped volumes (now deprecated) only.
The issue with this is that I don't want to tune the volume just for small files and hurt the performance of lager I/O operations.

> Looks like others are having this sort of performance problem:
> http://lists.gluster.org/pipermail/gluster-users/2015-April/021487.html
> Some recommended values by one poster that might help out (https://forum.proxmox.com/threads/horribly-slow-gluster-performance.26319/)  Going to try in my LAB and let you know:
> > GlusterFS 3.7 parameters:

GlusterFS 3.7 is really old so I'd be careful looking at settings / tuning for it.

> nfs.trusted-sync: on

Not using NFS.

> performance.cache-size: 1GB

Already set to 1024MB, but that's only for reads not writes.

> performance.io-thread-count: 16

That's my current setting.

> performance.write-behind-window-size: 8MB

Currently allowing even more cache up at 256MB.

> performance.readdir-ahead: on

That's my current setting (the default now I believe).

> client.event-threads: 8

That's my current setting (the default now I believe).

> server.event-threads: 8

That's my current setting (the default now I believe).

> cluster.quorum-type: auto

Not sure how that's going to impact small I/O performance.
I currently have this set to none, but do use an arbiter node.

> cluster.server-quorum-type: server

Not sure how that's going to impact small I/O performance.
I currently have this set to off, but do use an arbiter node.

> cluster.server-quorum-ratio: 51%

Not sure how that's going to impact small I/O performance.
I currently have this set to 0, but do use an arbiter node.

> > Kernel parameters:
> net.ipv4.tcp_slow_start_after_idle = 0

That's my current setting.

> net.ipv4.tcp_fin_timeout = 15

I've set this right down to 5.

> net.core.somaxconn = 65535

That's my current setting.

> vm.swappiness = 1

That's my current setting, we don't have swap - other than ZRAM enabled on any hosts.

> vm.dirty_ratio = 5

N/A as swap disabled (ZRAM only)

> vm.dirty_background_ratio = 2

N/A as swap disabled (ZRAM only)

> vm.min_free_kbytes = 524288 			# this is on 128GB RAM

I have this set to vm.min_free_kbytes = 67584, I'd be worried that setting this high would cause OOM as per the official kernel docs:


This is used to force the Linux VM to keep a minimum number
of kilobytes free.  The VM uses this number to compute a
watermark[WMARK_MIN] value for each lowmem zone in the system.
Each lowmem zone gets a number of reserved free pages based
proportionally on its size.

Some minimal amount of memory is needed to satisfy PF_MEMALLOC
allocations; if you set this to lower than 1024KB, your system will
become subtly broken, and prone to deadlock under high loads.

Setting this too high will OOM your machine instantly.

> On 20 Mar 2018, at 1:52 am, Rik Theys <Rik.Theys at esat.kuleuven.be> wrote:
> That's not really a fare comparison as you don't specify a blocksize.
> What does
> dd if=/dev/zero of=./some-file.bin bs=1M count=1000 oflag=direct
> give?
> Rik

DD is not going to give anyone particularly useful benchmarks, especially with small file sizes, in fact it's more likely to mislead you than be useful.
See my short post on fio here: https://smcleod.net/tech/2016/04/29/benchmarking-io.html <https://smcleod.net/tech/2016/04/29/benchmarking-io.html> , I believe it's one of the most useful tools for I/O benchmarking.

Just for a laugh I compared dd writes for 4k (small) writes between the client (gluster mounted on the cli) and a gluster host (to a directory on the same storage as the bricks).
The client came out faster, likely the direct I/O flag was not working as perhaps intended.


# dd if=/dev/zero of=./some-file.bin bs=4K count=4096 oflag=direct
4096+0 records in
4096+0 records out
16777216 bytes (17 MB) copied, 2.27839 s, 7.4 MB/s


dd if=/dev/zero of=./some-file.bin bs=4K count=4096 oflag=direct
4096+0 records in
4096+0 records out
16777216 bytes (17 MB) copied, 3.94093 s, 4.3 MB/s

> Note: At the end of this reply is a listing of all our volume settings (gluster get volname all).

Here is an output of all gluster volume settings as they currently stand:

 # gluster volume get uat_storage all
Option                                  Value
------                                  -----
cluster.lookup-unhashed                 on
cluster.lookup-optimize                 true
cluster.min-free-disk                   10%
cluster.min-free-inodes                 5%
cluster.rebalance-stats                 off
cluster.subvols-per-directory           (null)
cluster.readdir-optimize                true
cluster.rsync-hash-regex                (null)
cluster.extra-hash-regex                (null)
cluster.dht-xattr-name                  trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid    off
cluster.rebal-throttle                  normal
cluster.lock-migration                  off
cluster.local-volume-name               (null)
cluster.weighted-rebalance              on
cluster.switch-pattern                  (null)
cluster.entry-change-log                on
cluster.read-subvolume                  (null)
cluster.read-subvolume-index            -1
cluster.read-hash-mode                  1
cluster.background-self-heal-count      8
cluster.metadata-self-heal              on
cluster.data-self-heal                  on
cluster.entry-self-heal                 on
cluster.self-heal-daemon                on
cluster.heal-timeout                    600
cluster.self-heal-window-size           1
cluster.data-change-log                 on
cluster.metadata-change-log             on
cluster.data-self-heal-algorithm        (null)
cluster.eager-lock                      true
disperse.eager-lock                     on
disperse.other-eager-lock               on
cluster.quorum-type                     none
cluster.quorum-count                    (null)
cluster.choose-local                    true
cluster.self-heal-readdir-size          1KB
cluster.post-op-delay-secs              1
cluster.ensure-durability               on
cluster.consistent-metadata             no
cluster.heal-wait-queue-length          128
cluster.favorite-child-policy           size
cluster.full-lock                       yes
cluster.stripe-block-size               128KB
cluster.stripe-coalesce                 true
diagnostics.latency-measurement         off
diagnostics.dump-fd-stats               off
diagnostics.count-fop-hits              off
diagnostics.brick-log-level             ERROR
diagnostics.client-log-level            ERROR
diagnostics.brick-sys-log-level         CRITICAL
diagnostics.client-sys-log-level        CRITICAL
diagnostics.brick-logger                (null)
diagnostics.client-logger               (null)
diagnostics.brick-log-format            (null)
diagnostics.client-log-format           (null)
diagnostics.brick-log-buf-size          5
diagnostics.client-log-buf-size         5
diagnostics.brick-log-flush-timeout     120
diagnostics.client-log-flush-timeout    120
diagnostics.stats-dump-interval         0
diagnostics.fop-sample-interval         0
diagnostics.stats-dump-format           json
diagnostics.fop-sample-buf-size         65535
diagnostics.stats-dnscache-ttl-sec      86400
performance.cache-max-file-size         6MB
performance.cache-min-file-size         0
performance.cache-refresh-timeout       60
performance.cache-size                  1024MB
performance.io-thread-count             16
performance.high-prio-threads           16
performance.normal-prio-threads         16
performance.low-prio-threads            16
performance.least-prio-threads          1
performance.enable-least-priority       on
performance.cache-size                  1024MB
performance.flush-behind                on
performance.nfs.flush-behind            on
performance.write-behind-window-size    256MB
performance.strict-o-direct             off
performance.nfs.strict-o-direct         off
performance.strict-write-ordering       off
performance.nfs.strict-write-ordering   off
performance.lazy-open                   yes
performance.read-after-open             no
performance.read-ahead-page-count       4
performance.md-cache-timeout            600
performance.cache-swift-metadata        true
performance.cache-samba-metadata        false
performance.cache-capability-xattrs     true
performance.cache-ima-xattrs            true
features.encryption                     off
encryption.master-key                   (null)
encryption.data-key-size                256
encryption.block-size                   4096
network.frame-timeout                   1800
network.ping-timeout                    15
network.tcp-window-size                 (null)
features.lock-heal                      off
features.grace-timeout                  10
network.remote-dio                      disable
client.event-threads                    8
client.tcp-user-timeout                 0
client.keepalive-time                   20
client.keepalive-interval               2
client.keepalive-count                  9
network.tcp-window-size                 (null)
network.inode-lru-limit                 50000
auth.allow                              *
auth.reject                             (null)
transport.keepalive                     1
server.allow-insecure                   (null)
server.root-squash                      off
server.anonuid                          65534
server.anongid                          65534
server.statedump-path                   /var/run/gluster
server.outstanding-rpc-limit            256
features.lock-heal                      off
features.grace-timeout                  10
server.ssl                              (null)
auth.ssl-allow                          *
server.manage-gids                      off
server.dynamic-auth                     on
client.send-gids                        on
server.gid-timeout                      300
server.own-thread                       (null)
server.event-threads                    8
server.tcp-user-timeout                 0
server.keepalive-time                   20
server.keepalive-interval               2
server.keepalive-count                  9
transport.listen-backlog                2048
ssl.own-cert                            (null)
ssl.private-key                         (null)
ssl.ca-list                             (null)
ssl.crl-path                            (null)
ssl.certificate-depth                   (null)
ssl.cipher-list                         (null)
ssl.dh-param                            (null)
ssl.ec-curve                            (null)
transport.address-family                inet
performance.write-behind                on
performance.read-ahead                  on
performance.readdir-ahead               on
performance.io-cache                    on
performance.quick-read                  on
performance.open-behind                 on
performance.nl-cache                    off
performance.stat-prefetch               true
performance.client-io-threads           true
performance.nfs.write-behind            on
performance.nfs.read-ahead              off
performance.nfs.io-cache                off
performance.nfs.quick-read              off
performance.nfs.stat-prefetch           off
performance.nfs.io-threads              off
performance.force-readdirp              true
performance.cache-invalidation          true
features.uss                            off
features.snapshot-directory             .snaps
features.show-snapshot-directory        off
network.compression                     off
network.compression.window-size         -15
network.compression.mem-level           8
network.compression.min-size            0
network.compression.compression-level   -1
network.compression.debug               false
features.limit-usage                    (null)
features.default-soft-limit             80%
features.soft-timeout                   60
features.hard-timeout                   5
features.alert-time                     86400
features.quota-deem-statfs              off
geo-replication.indexing                off
geo-replication.indexing                off
geo-replication.ignore-pid-check        off
geo-replication.ignore-pid-check        off
features.quota                          off
features.inode-quota                    off
features.bitrot                         disable
debug.trace                             off
debug.log-history                       no
debug.log-file                          no
debug.exclude-ops                       (null)
debug.include-ops                       (null)
debug.error-gen                         off
debug.error-failure                     (null)
debug.error-number                      (null)
debug.random-failure                    off
debug.error-fops                        (null)
nfs.disable                             on
features.read-only                      off
features.worm                           off
features.worm-file-level                off
features.worm-files-deletable           on
features.default-retention-period       120
features.retention-mode                 relax
features.auto-commit-period             180
storage.linux-aio                       off
storage.batch-fsync-mode                reverse-fsync
storage.batch-fsync-delay-usec          0
storage.owner-uid                       -1
storage.owner-gid                       -1
storage.node-uuid-pathinfo              off
storage.health-check-interval           30
storage.build-pgfid                     off
storage.gfid2path                       on
storage.gfid2path-separator             :
storage.reserve                         1
storage.bd-aio                          off
config.gfproxyd                         off
cluster.server-quorum-type              off
cluster.server-quorum-ratio             0
changelog.changelog                     off
changelog.changelog-dir                 (null)
changelog.encoding                      ascii
changelog.rollover-time                 15
changelog.fsync-interval                5
changelog.changelog-barrier-timeout     120
changelog.capture-del-path              off
features.barrier                        disable
features.barrier-timeout                120
features.trash                          off
features.trash-dir                      .trashcan
features.trash-eliminate-path           (null)
features.trash-max-filesize             5MB
features.trash-internal-op              off
cluster.enable-shared-storage           disable
cluster.write-freq-threshold            0
cluster.read-freq-threshold             0
cluster.tier-pause                      off
cluster.tier-promote-frequency          120
cluster.tier-demote-frequency           3600
cluster.watermark-hi                    90
cluster.watermark-low                   75
cluster.tier-mode                       cache
cluster.tier-max-promote-file-size      0
cluster.tier-max-mb                     4000
cluster.tier-max-files                  10000
cluster.tier-query-limit                100
cluster.tier-compact                    on
cluster.tier-hot-compact-frequency      604800
cluster.tier-cold-compact-frequency     604800
features.ctr-enabled                    off
features.record-counters                off
features.ctr-record-metadata-heat       off
features.ctr_link_consistency           off
features.ctr_lookupheal_link_timeout    300
features.ctr_lookupheal_inode_timeout   300
features.ctr-sql-db-cachesize           12500
features.ctr-sql-db-wal-autocheckpoint  25000
features.selinux                        on
locks.trace                             off
locks.mandatory-locking                 off
cluster.disperse-self-heal-daemon       enable
cluster.quorum-reads                    no
client.bind-insecure                    (null)
features.shard                          off
features.shard-block-size               64MB
features.scrub-throttle                 lazy
features.scrub-freq                     biweekly
features.scrub                          false
features.expiry-time                    120
features.cache-invalidation             true
features.cache-invalidation-timeout     600
features.leases                         off
features.lease-lock-recall-timeout      60
disperse.background-heals               8
disperse.heal-wait-qlength              128
cluster.heal-timeout                    600
dht.force-readdirp                      on
disperse.read-policy                    round-robin
cluster.shd-max-threads                 1
cluster.shd-wait-qlength                1024
cluster.locking-scheme                  full
cluster.granular-entry-heal             no
features.locks-revocation-secs          0
features.locks-revocation-clear-all     false
features.locks-revocation-max-blocked   0
features.locks-monkey-unlocking         false
disperse.shd-max-threads                1
disperse.shd-wait-qlength               1024
disperse.cpu-extensions                 auto
disperse.self-heal-window-size          1
cluster.use-compound-fops               true
performance.parallel-readdir            off
performance.rda-request-size            131072
performance.rda-low-wmark               4096
performance.rda-high-wmark              128KB
performance.rda-cache-limit             256MB
performance.nl-cache-positive-entry     false
performance.nl-cache-limit              10MB
performance.nl-cache-timeout            60
cluster.brick-multiplex                 off
cluster.max-bricks-per-process          0
disperse.optimistic-change-log          on
cluster.halo-enabled                    False
cluster.halo-shd-max-latency            99999
cluster.halo-nfsd-max-latency           5
cluster.halo-max-latency                5
cluster.halo-max-replicas               99999
cluster.halo-min-replicas               2
debug.delay-gen                         off
delay-gen.delay-percentage              10%
delay-gen.delay-duration                100000
delay-gen.enable                        (null)
disperse.parallel-writes                on

Sam McLeod (protoporpoise on IRC)

Words are my own opinions and do not necessarily represent those of my employer or partners.
