[Gluster-users] GFS performance under heavy traffic
David Cunningham
dcunningham at voisonics.com
Fri Dec 27 01:23:51 UTC 2019
Oh and I see that the op-version is slightly less than the max-op-version:
[root at gfs1 ~]# gluster volume get all cluster.max-op-version
Option Value
------ -----
cluster.max-op-version 50400
[root at gfs1 ~]# gluster volume get all cluster.op-version
Option Value
------ -----
cluster.op-version 50000
On Fri, 27 Dec 2019 at 14:22, David Cunningham <dcunningham at voisonics.com>
wrote:
> Hi Strahil,
>
> Our volume options are as below. Thanks for the suggestion to upgrade to
> version 6 or 7. We could do that be simply removing the current
> installation and installing the new one (since it's not live right now). We
> might have to convince the customer that it's likely to succeed though, as
> at the moment I think they believe that GFS is not going to work for them.
>
> Option Value
>
> ------ -----
>
> cluster.lookup-unhashed on
>
> cluster.lookup-optimize on
>
> cluster.min-free-disk 10%
>
> cluster.min-free-inodes 5%
>
> cluster.rebalance-stats off
>
> cluster.subvols-per-directory (null)
>
> cluster.readdir-optimize off
>
> cluster.rsync-hash-regex (null)
>
> cluster.extra-hash-regex (null)
>
> cluster.dht-xattr-name trusted.glusterfs.dht
>
> cluster.randomize-hash-range-by-gfid off
>
> cluster.rebal-throttle normal
>
> cluster.lock-migration off
>
> cluster.force-migration off
>
> cluster.local-volume-name (null)
>
> cluster.weighted-rebalance on
>
> cluster.switch-pattern (null)
>
> cluster.entry-change-log on
>
> cluster.read-subvolume (null)
>
> cluster.read-subvolume-index -1
>
> cluster.read-hash-mode 1
>
> cluster.background-self-heal-count 8
>
> cluster.metadata-self-heal on
>
> cluster.data-self-heal on
>
> cluster.entry-self-heal on
>
> cluster.self-heal-daemon on
>
> cluster.heal-timeout 600
>
> cluster.self-heal-window-size 1
>
> cluster.data-change-log on
>
> cluster.metadata-change-log on
>
> cluster.data-self-heal-algorithm (null)
>
> cluster.eager-lock on
>
> disperse.eager-lock on
>
> disperse.other-eager-lock on
>
> disperse.eager-lock-timeout 1
>
> disperse.other-eager-lock-timeout 1
>
> cluster.quorum-type none
>
> cluster.quorum-count (null)
>
> cluster.choose-local true
>
> cluster.self-heal-readdir-size 1KB
>
> cluster.post-op-delay-secs 1
>
> cluster.ensure-durability on
>
> cluster.consistent-metadata no
>
> cluster.heal-wait-queue-length 128
>
> cluster.favorite-child-policy none
>
> cluster.full-lock yes
>
> cluster.stripe-block-size 128KB
>
> cluster.stripe-coalesce true
>
> diagnostics.latency-measurement off
>
> diagnostics.dump-fd-stats off
>
> diagnostics.count-fop-hits off
>
> diagnostics.brick-log-level INFO
>
> diagnostics.client-log-level INFO
>
> diagnostics.brick-sys-log-level CRITICAL
>
> diagnostics.client-sys-log-level CRITICAL
>
> diagnostics.brick-logger (null)
>
> diagnostics.client-logger (null)
>
> diagnostics.brick-log-format (null)
>
> diagnostics.client-log-format (null)
>
> diagnostics.brick-log-buf-size 5
>
> diagnostics.client-log-buf-size 5
>
> diagnostics.brick-log-flush-timeout 120
>
> diagnostics.client-log-flush-timeout 120
>
> diagnostics.stats-dump-interval 0
>
> diagnostics.fop-sample-interval 0
>
> diagnostics.stats-dump-format json
>
> diagnostics.fop-sample-buf-size 65535
>
> diagnostics.stats-dnscache-ttl-sec 86400
>
> performance.cache-max-file-size 0
>
> performance.cache-min-file-size 0
>
> performance.cache-refresh-timeout 1
>
> performance.cache-priority
>
> performance.cache-size 32MB
>
> performance.io-thread-count 16
>
> performance.high-prio-threads 16
>
> performance.normal-prio-threads 16
>
> performance.low-prio-threads 16
>
> performance.least-prio-threads 1
>
> performance.enable-least-priority on
>
> performance.iot-watchdog-secs (null)
>
> performance.iot-cleanup-disconnected-reqsoff
>
> performance.iot-pass-through false
>
> performance.io-cache-pass-through false
>
> performance.cache-size 128MB
>
> performance.qr-cache-timeout 1
>
> performance.cache-invalidation false
>
> performance.ctime-invalidation false
>
> performance.flush-behind on
>
> performance.nfs.flush-behind on
>
> performance.write-behind-window-size 1MB
>
> performance.resync-failed-syncs-after-fsyncoff
>
> performance.nfs.write-behind-window-size1MB
>
> performance.strict-o-direct off
>
> performance.nfs.strict-o-direct off
>
> performance.strict-write-ordering off
>
> performance.nfs.strict-write-ordering off
>
> performance.write-behind-trickling-writeson
>
> performance.aggregate-size 128KB
>
> performance.nfs.write-behind-trickling-writeson
>
> performance.lazy-open yes
>
> performance.read-after-open yes
>
> performance.open-behind-pass-through false
>
> performance.read-ahead-page-count 4
>
> performance.read-ahead-pass-through false
>
> performance.readdir-ahead-pass-through false
>
> performance.md-cache-pass-through false
>
> performance.md-cache-timeout 1
>
> performance.cache-swift-metadata true
>
> performance.cache-samba-metadata false
>
> performance.cache-capability-xattrs true
>
> performance.cache-ima-xattrs true
>
> performance.md-cache-statfs off
>
> performance.xattr-cache-list
>
> performance.nl-cache-pass-through false
>
> features.encryption off
>
> encryption.master-key (null)
>
> encryption.data-key-size 256
>
> encryption.block-size 4096
>
> network.frame-timeout 1800
>
> network.ping-timeout 42
>
> network.tcp-window-size (null)
>
> network.remote-dio disable
>
> client.event-threads 2
>
> client.tcp-user-timeout 0
>
> client.keepalive-time 20
>
> client.keepalive-interval 2
>
> client.keepalive-count 9
>
> network.tcp-window-size (null)
>
> network.inode-lru-limit 16384
>
> auth.allow *
>
> auth.reject (null)
>
> transport.keepalive 1
>
> server.allow-insecure on
>
> server.root-squash off
>
> server.anonuid 65534
>
> server.anongid 65534
>
> server.statedump-path /var/run/gluster
>
> server.outstanding-rpc-limit 64
>
> server.ssl (null)
>
> auth.ssl-allow *
>
> server.manage-gids off
>
> server.dynamic-auth on
>
> client.send-gids on
>
> server.gid-timeout 300
>
> server.own-thread (null)
>
> server.event-threads 1
>
> server.tcp-user-timeout 0
>
> server.keepalive-time 20
>
> server.keepalive-interval 2
>
> server.keepalive-count 9
>
> transport.listen-backlog 1024
>
> ssl.own-cert (null)
>
> ssl.private-key (null)
>
> ssl.ca-list (null)
>
> ssl.crl-path (null)
>
> ssl.certificate-depth (null)
>
> ssl.cipher-list (null)
>
> ssl.dh-param (null)
>
> ssl.ec-curve (null)
>
> transport.address-family inet
>
> performance.write-behind on
>
> performance.read-ahead on
>
> performance.readdir-ahead on
>
> performance.io-cache on
>
> performance.quick-read on
>
> performance.open-behind on
>
> performance.nl-cache off
>
> performance.stat-prefetch on
>
> performance.client-io-threads off
>
> performance.nfs.write-behind on
>
> performance.nfs.read-ahead off
>
> performance.nfs.io-cache off
>
> performance.nfs.quick-read off
>
> performance.nfs.stat-prefetch off
>
> performance.nfs.io-threads off
>
> performance.force-readdirp true
>
> performance.cache-invalidation false
>
> features.uss off
>
> features.snapshot-directory .snaps
>
> features.show-snapshot-directory off
>
> features.tag-namespaces off
>
> network.compression off
>
> network.compression.window-size -15
>
> network.compression.mem-level 8
>
> network.compression.min-size 0
>
> network.compression.compression-level -1
>
> network.compression.debug false
>
> features.default-soft-limit 80%
>
> features.soft-timeout 60
>
> features.hard-timeout 5
>
> features.alert-time 86400
>
> features.quota-deem-statfs off
>
> geo-replication.indexing off
>
> geo-replication.indexing off
>
> geo-replication.ignore-pid-check off
>
> geo-replication.ignore-pid-check off
>
> features.quota off
>
> features.inode-quota off
>
> features.bitrot disable
>
> debug.trace off
>
> debug.log-history no
>
> debug.log-file no
>
> debug.exclude-ops (null)
>
> debug.include-ops (null)
>
> debug.error-gen off
>
> debug.error-failure (null)
>
> debug.error-number (null)
>
> debug.random-failure off
>
> debug.error-fops (null)
>
> nfs.disable on
>
> features.read-only off
>
> features.worm off
>
> features.worm-file-level off
>
> features.worm-files-deletable on
>
> features.default-retention-period 120
>
> features.retention-mode relax
>
> features.auto-commit-period 180
>
> storage.linux-aio off
>
> storage.batch-fsync-mode reverse-fsync
>
> storage.batch-fsync-delay-usec 0
>
> storage.owner-uid -1
>
> storage.owner-gid -1
>
> storage.node-uuid-pathinfo off
>
> storage.health-check-interval 30
>
> storage.build-pgfid off
>
> storage.gfid2path on
>
> storage.gfid2path-separator :
>
> storage.reserve 1
>
> storage.health-check-timeout 10
>
> storage.fips-mode-rchecksum off
>
> storage.force-create-mode 0000
>
> storage.force-directory-mode 0000
>
> storage.create-mask 0777
>
> storage.create-directory-mask 0777
>
> storage.max-hardlinks 100
>
> storage.ctime off
>
> storage.bd-aio off
>
> config.gfproxyd off
>
> cluster.server-quorum-type off
>
> cluster.server-quorum-ratio 0
>
> changelog.changelog off
>
> changelog.changelog-dir {{ brick.path
> }}/.glusterfs/changelogs
> changelog.encoding ascii
>
> changelog.rollover-time 15
>
> changelog.fsync-interval 5
>
> changelog.changelog-barrier-timeout 120
>
> changelog.capture-del-path off
>
> features.barrier disable
>
> features.barrier-timeout 120
>
> features.trash off
>
> features.trash-dir .trashcan
>
> features.trash-eliminate-path (null)
>
> features.trash-max-filesize 5MB
>
> features.trash-internal-op off
>
> cluster.enable-shared-storage disable
>
> cluster.write-freq-threshold 0
>
> cluster.read-freq-threshold 0
>
> cluster.tier-pause off
>
> cluster.tier-promote-frequency 120
>
> cluster.tier-demote-frequency 3600
>
> cluster.watermark-hi 90
>
> cluster.watermark-low 75
>
> cluster.tier-mode cache
>
> cluster.tier-max-promote-file-size 0
>
> cluster.tier-max-mb 4000
>
> cluster.tier-max-files 10000
>
> cluster.tier-query-limit 100
>
> cluster.tier-compact on
>
> cluster.tier-hot-compact-frequency 604800
>
> cluster.tier-cold-compact-frequency 604800
>
> features.ctr-enabled off
>
> features.record-counters off
>
> features.ctr-record-metadata-heat off
>
> features.ctr_link_consistency off
>
> features.ctr_lookupheal_link_timeout 300
>
> features.ctr_lookupheal_inode_timeout 300
>
> features.ctr-sql-db-cachesize 12500
>
> features.ctr-sql-db-wal-autocheckpoint 25000
>
> features.selinux on
>
> locks.trace off
>
> locks.mandatory-locking off
>
> cluster.disperse-self-heal-daemon enable
>
> cluster.quorum-reads no
>
> client.bind-insecure (null)
>
> features.shard off
>
> features.shard-block-size 64MB
>
> features.shard-lru-limit 16384
>
> features.shard-deletion-rate 100
>
> features.scrub-throttle lazy
>
> features.scrub-freq biweekly
>
> features.scrub false
>
> features.expiry-time 120
>
> features.cache-invalidation off
>
> features.cache-invalidation-timeout 60
>
> features.leases off
>
> features.lease-lock-recall-timeout 60
>
> disperse.background-heals 8
>
> disperse.heal-wait-qlength 128
>
> cluster.heal-timeout 600
>
> dht.force-readdirp on
>
> disperse.read-policy gfid-hash
>
> cluster.shd-max-threads 1
>
> cluster.shd-wait-qlength 1024
>
> cluster.locking-scheme full
>
> cluster.granular-entry-heal no
>
> features.locks-revocation-secs 0
>
> features.locks-revocation-clear-all false
>
> features.locks-revocation-max-blocked 0
>
> features.locks-monkey-unlocking false
>
> features.locks-notify-contention no
>
> features.locks-notify-contention-delay 5
>
> disperse.shd-max-threads 1
>
> disperse.shd-wait-qlength 1024
>
> disperse.cpu-extensions auto
>
> disperse.self-heal-window-size 1
>
> cluster.use-compound-fops off
>
> performance.parallel-readdir off
>
> performance.rda-request-size 131072
>
> performance.rda-low-wmark 4096
>
> performance.rda-high-wmark 128KB
>
> performance.rda-cache-limit 10MB
>
> performance.nl-cache-positive-entry false
>
> performance.nl-cache-limit 10MB
>
> performance.nl-cache-timeout 60
>
> cluster.brick-multiplex off
>
> cluster.max-bricks-per-process 0
>
> disperse.optimistic-change-log on
>
> disperse.stripe-cache 4
>
> cluster.halo-enabled False
>
> cluster.halo-shd-max-latency 99999
>
> cluster.halo-nfsd-max-latency 5
>
> cluster.halo-max-latency 5
>
> cluster.halo-max-replicas 99999
>
> cluster.halo-min-replicas 2
>
> cluster.daemon-log-level INFO
>
> debug.delay-gen off
>
> delay-gen.delay-percentage 10%
>
> delay-gen.delay-duration 100000
>
> delay-gen.enable
>
> disperse.parallel-writes on
>
> features.sdfs on
>
> features.cloudsync off
>
> features.utime off
>
> ctime.noatime on
>
> feature.cloudsync-storetype (null)
>
>
> Thanks again.
>
>
> On Wed, 25 Dec 2019 at 05:51, Strahil <hunter86_bg at yahoo.com> wrote:
>
>> Hi David,
>>
>> On Dec 24, 2019 02:47, David Cunningham <dcunningham at voisonics.com>
>> wrote:
>> >
>> > Hello,
>> >
>> > In testing we found that actually the GFS client having access to all 3
>> nodes made no difference to performance. Perhaps that's because the 3rd
>> node that wasn't accessible from the client before was the arbiter node?
>> It makes sense, as no data is being generated towards the arbiter.
>> > Presumably we shouldn't have an arbiter node listed under
>> backupvolfile-server when mounting the filesystem? Since it doesn't store
>> all the data surely it can't be used to serve the data.
>>
>> I have my arbiter defined as last backup and no issues so far. At least
>> the admin can easily identify the bricks from the mount options.
>>
>> > We did have direct-io-mode=disable already as well, so that wasn't a
>> factor in the performance problems.
>>
>> Have you checked if the client vedsion ia not too old.
>> Also you can check the cluster's operation cersion:
>> # gluster volume get all cluster.max-op-version
>> # gluster volume get all cluster.op-version
>>
>> Cluster's op version should be at max-op-version.
>>
>> In my mind come 2 options:
>> A) Upgrade to latest GLUSTER v6 or even v7 ( I know it won't be easy) and
>> then set the op version to highest possible.
>> # gluster volume get all cluster.max-op-version
>> # gluster volume get all cluster.op-version
>>
>> B) Deploy a NFS Ganesha server and connect the client over NFS v4.2 (and
>> control the parallel connections from Ganesha).
>>
>> Can you provide your Gluster volume's options?
>> 'gluster volume get <VOLNAME> all'
>>
>> > Thanks again for any advice.
>> >
>> >
>> >
>> > On Mon, 23 Dec 2019 at 13:09, David Cunningham <
>> dcunningham at voisonics.com> wrote:
>> >>
>> >> Hi Strahil,
>> >>
>> >> Thanks for that. We do have one backup server specified, but will add
>> the second backup as well.
>> >>
>> >>
>> >> On Sat, 21 Dec 2019 at 11:26, Strahil <hunter86_bg at yahoo.com> wrote:
>> >>>
>> >>> Hi David,
>> >>>
>> >>> Also consider using the mount option to specify backup server via
>> 'backupvolfile-server=server2:server3' (you can define more but I don't
>> thing replica volumes greater that 3 are usefull (maybe in some special
>> cases).
>> >>>
>> >>> In such way, when the primary is lost, your client can reach a backup
>> one without disruption.
>> >>>
>> >>> P.S.: Client may 'hang' - if the primary server got rebooted
>> ungracefully - as the communication must timeout before FUSE addresses the
>> next server. There is a special script for killing gluster processes in
>> '/usr/share/gluster/scripts' which can be used for setting up a systemd
>> service to do that for you on shutdown.
>> >>>
>> >>> Best Regards,
>> >>> Strahil Nikolov
>> >>>
>> >>> On Dec 20, 2019 23:49, David Cunningham <dcunningham at voisonics.com>
>> wrote:
>> >>>>
>> >>>> Hi Stahil,
>> >>>>
>> >>>> Ah, that is an important point. One of the nodes is not accessible
>> from the client, and we assumed that it only needed to reach the GFS node
>> that was mounted so didn't think anything of it.
>> >>>>
>> >>>> We will try making all nodes accessible, as well as
>> "direct-io-mode=disable".
>> >>>>
>> >>>> Thank you.
>> >>>>
>> >>>>
>> >>>> On Sat, 21 Dec 2019 at 10:29, Strahil Nikolov <hunter86_bg at yahoo.com>
>> wrote:
>> >>>>>
>> >>>>> Actually I haven't clarified myself.
>> >>>>> FUSE mounts on the client side is connecting directly to all bricks
>> consisted of the volume.
>> >>>>> If for some reason (bad routing, firewall blocked) there could be
>> cases where the client can reach 2 out of 3 bricks and this can constantly
>> cause healing to happen (as one of the bricks is never updated) which will
>> degrade the performance and cause excessive network usage.
>> >>>>> As your attachment is from one of the gluster nodes, this could be
>> the case.
>> >>>>>
>> >>>>> Best Regards,
>> >>>>> Strahil Nikolov
>> >>>>>
>> >>>>> В петък, 20 декември 2019 г., 01:49:56 ч. Гринуич+2, David
>> Cunningham <dcunningham at voisonics.com> написа:
>> >>>>>
>> >>>>>
>> >>>>> Hi Strahil,
>> >>>>>
>> >>>>> The chart attached to my original email is taken from the GFS
>> server.
>> >>>>>
>> >>>>> I'm not sure what you mean by accessing all bricks simultaneously.
>> We've mounted it from the client like this:
>> >>>>> gfs1:/gvol0 /mnt/glusterfs/ glusterfs
>> defaults,direct-io-mode=disable,_netdev,backupvolfile-server=gfs2,fetch-attempts=10
>> 0 0
>> >>>>>
>> >>>>> Should we do something different to access all bricks
>> simultaneously?
>> >>>>>
>> >>>>> Thanks for your help!
>> >>>>>
>> >>>>>
>> >>>>> On Fri, 20 Dec 2019 at 11:47, Strahil Nikolov <
>> hunter86_bg at yahoo.com> wrote:
>> >>>>>>
>> >>>>>> I'm not sure if you did measure the traffic from client side
>> (tcpdump on a client machine) or from Server side.
>> >>>>>>
>> >>>>>> In both cases , please verify that the client accesses all bricks
>> simultaneously, as this can cause unnecessary heals.
>> >>>>>>
>> >>>>>> Have you thought about upgrading to v6? There are some
>> enhancements in v6 which could be beneficial.
>> >>>>>>
>> >>>>>> Yet, it is indeed strange that so much traffic is generated with
>> FUSE.
>> >>>>>>
>> >>>>>> Another aproach is to test with NFSGanesha which suports pNFS and
>> can natively speak with Gluster, which cant bring you closer to the
>> previous setup and also provide some extra performance.
>> >>>>>>
>> >>>>>>
>> >>>>>> Best Regards,
>> >>>>>> Strahil Nikolov
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>
>> >>
>> >> --
>> >> David Cunningham, Voisonics Limited
>> >> http://voisonics.com/
>> >> USA: +1 213 221 1092
>> >> New Zealand: +64 (0)28 2558 3782
>> >
>> >
>> >
>> > --
>> > David Cunningham, Voisonics Limited
>> > http://voisonics.com/
>> > USA: +1 213 221 1092
>> > New Zealand: +64 (0)28 2558 3782
>>
>> Best Regards,
>> Strahil Nikolov
>>
>
>
> --
> David Cunningham, Voisonics Limited
> http://voisonics.com/
> USA: +1 213 221 1092
> New Zealand: +64 (0)28 2558 3782
>
--
David Cunningham, Voisonics Limited
http://voisonics.com/
USA: +1 213 221 1092
New Zealand: +64 (0)28 2558 3782
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20191227/4883a553/attachment.html>
More information about the Gluster-users
mailing list