[Gluster-users] GFS performance under heavy traffic
Strahil
hunter86_bg at yahoo.com
Sat Dec 28 04:46:49 UTC 2019
Hi David,
It seems that I have misread your quorum options, so just ignore that from my previous e-mail.
Best Regards,
Strahil NikolovOn Dec 27, 2019 15:38, Strahil <hunter86_bg at yahoo.com> wrote:
>
> Hi David,
>
> Gluster supports live rolling upgrade, so there is no need to redeploy at all - but the migration notes should be checked as some features must be disabled first.
> Also, the gluster client should remount in order to bump the gluster op-version.
>
> What kind of workload do you have ?
> I'm asking as there are predefined (and recommended) settings located at /var/lib/gluster/groups .
> You can check the options for each group and cross-check the options meaning in the docs before activating a setting.
>
> I still have a vague feeling that ,during that high-peak of network bandwidth, there was a heal going on. Have you checked that ?
>
> Also, sharding is very useful , when you work with large files and the heal is reduced to the size of the shard.
>
> N.B.: Once sharding is enabled, DO NOT DISABLE it - as you will loose your data.
>
> Using GLUSTER v7.1 (soon on CentOS & Debian) allows using latest features and optimizations while support from gluster Dev community is quite active.
>
> P.S: I'm wondering how 'performance.cache-size' can both be 32 MB and 128 MB. Please double-check this (maybe I'm reading it wrong on my smartphone) and if needed raise a bug on bugzilla.redhat.com
>
> P.S2: Please provide 'gluster volume info' as 'cluster.quorum-type' -> 'none' is not normal for replicated volumes (arbiters are using in replica volumes)
>
> According to the dooutput (otps://docs.gluster.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/) :
>
> Note: Enabling the arbiter feature automatically configures client-quorum to 'auto'. This setting is not to be changed.
>
> Here is my output (Hyperconverged Virtualization Cluster -> oVirt):
> # gluster volume info engine | grep quorum
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
>
> Changing quorum is more 'riskier' than other options, so you need to take necessary measures. I think , we all know what will happen , if the cluster is out of quorum and you change the quorum settings to more stringent ones :D
>
> P.S3: If you decide to reset your gluster volume to the defaults, you can create a new volume (same type as current one), the get the options for that volume and put them in a file and then bulk deploy via 'gluster volume set <Original Volume> group custom-group' , where the file is located on every gluster server in the '/var/lib/gluster/groups' directory.
> Last , get rid of the sample volume.
>
> Best Regards,
> Strahil Nikolov
>
> On Dec 27, 2019 03:22, David Cunningham <dcunningham at voisonics.com> wrote:
>>
>> Hi Strahil,
>>
>> Our volume options are as below. Thanks for the suggestion to upgrade to version 6 or 7. We could do that be simply removing the current installation and installing the new one (since it's not live right now). We might have to convince the customer that it's likely to succeed though, as at the moment I think they believe that GFS is not going to work for them.
>>
>> Option Value
>> ------ -----
>> cluster.lookup-unhashed on
>> cluster.lookup-optimize on
>> cluster.min-free-disk 10%
>> cluster.min-free-inodes 5%
>> cluster.rebalance-stats off
>> cluster.subvols-per-directory (null)
>> cluster.readdir-optimize off
>> cluster.rsync-hash-regex (null)
>> cluster.extra-hash-regex (null)
>> cluster.dht-xattr-name trusted.glusterfs.dht
>> cluster.randomize-hash-range-by-gfid off
>> cluster.rebal-throttle normal
>> cluster.lock-migration off
>> cluster.force-migration off
>> cluster.local-volume-name (null)
>> cluster.weighted-rebalance on
>> cluster.switch-pattern (null)
>> cluster.entry-change-log on
>> cluster.read-subvolume (null)
>> cluster.read-subvolume-index -1
>> cluster.read-hash-mode 1
>> cluster.background-self-heal-count 8
>> cluster.metadata-self-heal on
>> cluster.data-self-heal on
>> cluster.entry-self-heal on
>> cluster.self-heal-daemon on
>> cluster.heal-timeout 600
>> cluster.self-heal-window-size 1
>> cluster.data-change-log on
>> cluster.metadata-change-log on
>> cluster.data-self-heal-algorithm (null)
>> cluster.eager-lock on
>> disperse.eager-lock on
>> disperse.other-eager-lock on
>> disperse.eager-lock-timeout 1
>> disperse.other-eager-lock-timeout 1
>> cluster.quorum-type none
>> cluster.quorum-count (null)
>> cluster.choose-local true
>> cluster.self-heal-readdir-size 1KB
>> cluster.post-op-delay-secs 1
>> cluster.ensure-durability on
>> cluster.consistent-metadata no
>> cluster.heal-wait-queue-length 128
>> cluster.favorite-child-policy none
>> cluster.full-lock yes
>> cluster.stripe-block-size 128KB
>> cluster.stripe-coalesce true
>> diagnostics.latency-measurement off
>> diagnostics.dump-fd-stats off
>> diagnostics.count-fop-hits off
>> diagnostics.brick-log-level INFO
>> diagnostics.client-log-level INFO
>> diagnostics.brick-sys-log-level CRITICAL
>> diagnostics.client-sys-log-level CRITICAL
>> diagnostics.brick-logger (null)
>> diagnostics.client-logger (null)
>> diagnostics.brick-log-format (null)
>> diagnostics.client-log-format (null)
>> diagnostics.brick-log-buf-size 5
>> diagnostics.client-log-buf-size 5
>> diagnostics.brick-log-flush-timeout 120
>> diagnostics.client-log-flush-timeout 120
>> diagnostics.stats-dump-interval 0
>> diagnostics.fop-sample-interval 0
>> diagnostics.stats-dump-format json
>> diagnostics.fop-sample-buf-size 65535
>> diagnostics.stats-dnscache-ttl-sec 86400
>> performance.cache-max-file-size 0
>> performance.cache-min-file-size 0
>> performance.cache-refresh-timeout 1
>> performance.cache-priority
>> performance.cache-size 32MB
>> performance.io-thread-count 16
>> performance.high-prio-threads 16
>> performance.normal-prio-threads 16
>> performance.low-prio-threads 16
>> performance.least-prio-threads 1
>> performance.enable-least-priority on
>> performance.iot-watchdog-secs (null)
>> performance.iot-cleanup-disconnected-reqsoff
>> performance.iot-pass-through false
>> performance.io-cache-pass-through false
>> performance.cache-size 128MB
>> performance.qr-cache-timeout 1
>> performance.cache-invalidation false
>> performance.ctime-invalidation false
>> performance.flush-behind on
>> performance.nfs.flush-behind on
>> performance.write-behind-window-size 1MB
>> performance.resync-failed-syncs-after-fsyncoff
>> performance.nfs.write-behind-window-size1MB
>> performance.strict-o-direct off
>> performance.nfs.strict-o-direct off
>> performance.strict-write-ordering off
>> performance.nfs.strict-write-ordering off
>> performance.write-behind-trickling-writeson
>> performance.aggregate-size 128KB
>> performance.nfs.write-behind-trickling-writeson
>> performance.lazy-open yes
>> performance.read-after-open yes
>> performance.open-behind-pass-through false
>> performance.read-ahead-page-count 4
>> performance.read-ahead-pass-through false
>> performance.readdir-ahead-pass-through false
>> performance.md-cache-pass-through false
>> performance.md-cache-timeout 1
>> performance.cache-swift-metadata true
>> performance.cache-samba-metadata false
>> performance.cache-capability-xattrs true
>> performance.cache-ima-xattrs true
>> performance.md-cache-statfs off
>> performance.xattr-cache-list
>> performance.nl-cache-pass-through false
>> features.encryption off
>> encryption.master-key (null)
>> encryption.data-key-size 256
>> encryption.block-size 4096
>> network.frame-timeout 1800
>> network.ping-timeout 42
>> network.tcp-window-size (null)
>> network.remote-dio disable
>> client.event-threads 2
>> client.tcp-user-timeout 0
>> client.keepalive-time 20
>> client.keepalive-interval 2
>> client.keepalive-count 9
>> network.tcp-window-size (null)
>> network.inode-lru-limit 16384
>> auth.allow *
>> auth.reject (null)
>> transport.keepalive 1
>> server.allow-insecure on
>> server.root-squash off
>> server.anonuid 65534
>> server.anongid 65534
>> server.statedump-path /var/run/gluster
>> server.outstanding-rpc-limit 64
>> server.ssl (null)
>> auth.ssl-allow *
>> server.manage-gids off
>> server.dynamic-auth on
>> client.send-gids on
>> server.gid-timeout 300
>> server.own-thread (null)
>> server.event-threads 1
>> server.tcp-user-timeout 0
>> server.keepalive-time 20
>> server.keepalive-interval 2
>> server.keepalive-count 9
>> transport.listen-backlog 1024
>> ssl.own-cert (null)
>> ssl.private-key (null)
>> ssl.ca-list (null)
>> ssl.crl-path (null)
>> ssl.certificate-depth (null)
>> ssl.cipher-list (null)
>> ssl.dh-param (null)
>> ssl.ec-curve (null)
>> transport.address-family inet
>> performance.write-behind on
>> performance.read-ahead on
>> performance.readdir-ahead on
>> performance.io-cache on
>> performance.quick-read on
>> performance.open-behind on
>> performance.nl-cache off
>> performance.stat-prefetch on
>> performance.client-io-threads off
>> performance.nfs.write-behind on
>> performance.nfs.read-ahead off
>> performance.nfs.io-cache off
>> performance.nfs.quick-read off
>> performance.nfs.stat-prefetch off
>> performance.nfs.io-threads off
>> performance.force-readdirp true
>> performance.cache-invalidation false
>> features.uss off
>> features.snapshot-directory .snaps
>> features.show-snapshot-directory off
>> features.tag-namespaces off
>> network.compression off
>> network.compression.window-size -15
>> network.compression.mem-level 8
>> network.compression.min-size 0
>> network.compression.compression-level -1
>> network.compression.debug false
>> features.default-soft-limit 80%
>> features.soft-timeout 60
>> features.hard-timeout 5
>> features.alert-time 86400
>> features.quota-deem-statfs off
>> geo-replication.indexing off
>> geo-replication.indexing off
>> geo-replication.ignore-pid-check off
>> geo-replication.ignore-pid-check off
>> features.quota off
>> features.inode-quota off
>> features.bitrot disable
>> debug.trace off
>> debug.log-history no
>> debug.log-file no
>> debug.exclude-ops (null)
>> debug.include-ops (null)
>> debug.error-gen off
>> debug.error-failure (null)
>> debug.error-number (null)
>> debug.random-failure off
>> debug.error-fops (null)
>> nfs.disable on
>> features.read-only off
>> features.worm off
>> features.worm-file-level off
>> features.worm-files-deletable on
>> features.default-retention-period 120
>> features.retention-mode relax
>> features.auto-commit-period 180
>> storage.linux-aio off
>> storage.batch-fsync-mode reverse-fsync
>> storage.batch-fsync-delay-usec 0
>> storage.owner-uid -1
>> storage.owner-gid -1
>> storage.node-uuid-pathinfo off
>> storage.health-check-interval 30
>> storage.build-pgfid off
>> storage.gfid2path on
>> storage.gfid2path-separator :
>> storage.reserve 1
>> storage.health-check-timeout 10
>> storage.fips-mode-rchecksum off
>> storage.force-create-mode 0000
>> storage.force-directory-mode 0000
>> storage.create-mask 0777
>> storage.create-directory-mask 0777
>> storage.max-hardlinks 100
>> storage.ctime off
>> storage.bd-aio off
>> config.gfproxyd off
>> cluster.server-quorum-type off
>> cluster.server-quorum-ratio 0
>> changelog.changelog off
>> changelog.changelog-dir {{ brick.path }}/.glusterfs/changelogs
>> changelog.encoding ascii
>> changelog.rollover-time 15
>> changelog.fsync-interval 5
>> changelog.changelog-barrier-timeout 120
>> changelog.capture-del-path off
>> features.barrier disable
>> features.barrier-timeout 120
>> features.trash off
>> features.trash-dir .trashcan
>> features.trash-eliminate-path (null)
>> features.trash-max-filesize 5MB
>> features.trash-internal-op off
>> cluster.enable-shared-storage disable
>> cluster.write-freq-threshold 0
>> cluster.read-freq-threshold 0
>> cluster.tier-pause off
>> cluster.tier-promote-frequency 120
>> cluster.tier-demote-frequency 3600
>> cluster.watermark-hi 90
>> cluster.watermark-low 75
>> cluster.tier-mode cache
>> cluster.tier-max-promote-file-size 0
>> cluster.tier-max-mb 4000
>> cluster.tier-max-files 10000
>> cluster.tier-query-limit 100
>> cluster.tier-compact on
>> cluster.tier-hot-compact-frequency 604800
>> cluster.tier-cold-compact-frequency 604800
>> features.ctr-enabled off
>> features.record-counters off
>> features.ctr-record-metadata-heat off
>> features.ctr_link_consistency off
>> features.ctr_lookupheal_link_timeout 300
>> features.ctr_lookupheal_inode_timeout 300
>> features.ctr-sql-db-cachesize 12500
>> features.ctr-sql-db-wal-autocheckpoint 25000
>> features.selinux on
>> locks.trace off
>> locks.mandatory-locking off
>> cluster.disperse-self-heal-daemon enable
>> cluster.quorum-reads no
>> client.bind-insecure (null)
>> features.shard off
>> features.shard-block-size 64MB
>> features.shard-lru-limit 16384
>> features.shard-deletion-rate 100
>> features.scrub-throttle lazy
>> features.scrub-freq biweekly
>> features.scrub false
>> features.expiry-time 120
>> features.cache-invalidation off
>> features.cache-invalidation-timeout 60
>> features.leases off
>> features.lease-lock-recall-timeout 60
>> disperse.background-heals 8
>> disperse.heal-wait-qlength 128
>> cluster.heal-timeout 600
>> dht.force-readdirp on
>> disperse.read-policy gfid-hash
>> cluster.shd-max-threads 1
>> cluster.shd-wait-qlength 1024
>> cluster.locking-scheme full
>> cluster.granular-entry-heal no
>> features.locks-revocation-secs 0
>> features.locks-revocation-clear-all false
>> features.locks-revocation-max-blocked 0
>> features.locks-monkey-unlocking false
>> features.locks-notify-contention no
>> features.locks-notify-contention-delay 5
>> disperse.shd-max-threads 1
>> disperse.shd-wait-qlength 1024
>> disperse.cpu-extensions auto
>> disperse.self-heal-window-size 1
>> cluster.use-compound-fops off
>> performance.parallel-readdir off
>> performance.rda-request-size 131072
>> performance.rda-low-wmark 4096
>> performance.rda-high-wmark 128KB
>> performance.rda-cache-limit 10MB
>> performance.nl-cache-positive-entry false
>> performance.nl-cache-limit 10MB
>> performance.nl-cache-timeout 60
>> cluster.brick-multiplex off
>> cluster.max-bricks-per-process 0
>> disperse.optimistic-change-log on
>> disperse.stripe-cache 4
>> cluster.halo-enabled False
>> cluster.halo-shd-max-latency 99999
>> cluster.halo-nfsd-max-latency 5
>> cluster.halo-max-latency 5
>> cluster.halo-max-replicas 99999
>> cluster.halo-min-replicas 2
>> cluster.daemon-log-level INFO
>> debug.delay-gen off
>> delay-gen.delay-percentage 10%
>> delay-gen.delay-duration 100000
>> delay-gen.enable
>> disperse.parallel-writes on
>> features.sdfs on
>> features.cloudsync off
>> features.utime off
>> ctime.noatime on
>> feature.cloudsync-storetype (null)
>>
>> Thanks again.
>>
>>
>> On Wed, 25 Dec 2019 at 05:51, Strahil <hunter86_bg at yahoo.com> wrote:
>>>
>>> Hi David,
>>>
>>> On Dec 24, 2019 02:47, David Cunningham <dcunningham at voisonics.com> wrote:
>>> >
>>> > Hello,
>>> >
>>> > In testing we found that actually the GFS client having access to all 3 nodes made no difference to performance. Perhaps that's because the 3rd node that wasn't accessible from the client before was the arbiter node?
>>> It makes sense, as no data is being generated towards the arbiter.
>>> > Presumably we shouldn't have an arbiter node listed under backupvolfile-server when mounting the filesystem? Since it doesn't store all the data surely it can't be used to serve the data.
>>>
>>> I have my arbiter defined as last backup and no issues so far. At least the admin can easily identify the bricks from the mount options.
>>>
>>> > We did have direct-io-mode=disable already as well, so that wasn't a factor in the performance problems.
>>>
>>> Have you checked if the client vedsion ia not too old.
>>> Also you can check the cluster's operation cersion:
>>> # gluster volume get all cluster.max-op-version
>>> # gluster volume get all cluster.op-version
>>>
>>> Cluster's op version should be at max-op-version.
>>>
>>> In my mind come 2 options:
>>> A) Upgrade to latest GLUSTER v6 or even v7 ( I know it won't be easy) and then set the op version to highest possible.
>>> # gluster volume get all cluster.max-op-version
>>> # gluster volume get all cluster.op-version
>>>
>>> B) Deploy a NFS Ganesha server and connect the client over NFS v4.2 (and control the parallel connections from Ganesha).
>>>
>>> Can you provide your Gluster volume's options?
>>> 'gluster volume get <VOLNAME> all'
>>>
>>> > Thanks again for any advice.
>>> >
>>> >
>>> >
>>> > On Mon, 23 Dec 2019 at 13:09, David Cunningham <dcunningham at voisonics.com> wrote:
>>> >>
>>> >> Hi Strahil,
>>> >>
>>> >> Thanks for that. We do have one backup server specified, but will add the second backup as well.
>>> >>
>>> >>
>>> >> On Sat, 21 Dec 2019 at 11:26, Strahil <hunter86_bg at yahoo.com> wrote:
>>> >>>
>>> >>> Hi David,
>>> >>>
>>> >>> Also consider using the mount option to specify backup server via 'backupvolfile-server=server2:server3' (you can define more but I don't thing replica volumes greater that 3 are usefull (maybe in some special cases).
>>> >>>
>>> >>> In such way, when the primary is lost, your client can reach a backup one without disruption.
>>> >>>
>>> >>> P.S.: Client may 'hang' - if the primary server got rebooted ungracefully - as the communication must timeout before FUSE addresses the next server. There is a special script for killing gluster processes in '/usr/share/gluster/scripts' which can be used for setting up a systemd service to do that for you on shutdown.
>>> >>>
>>> >>> Best Regards,
>>> >>> Strahil Nikolov
>>> >>>
>>> >>> On Dec 20, 2019 23:49, David Cunningham <dcunningham at voisonics.com> wrote:
>>> >>>>
>>> >>>> Hi Stahil,
>>> >>>>
>>> >>>> Ah, that is an important point. One of the nodes is not accessible from the client, and we assumed that it only needed to reach the GFS node that was mounted so didn't think anything of it.
>>> >>>>
>>> >>>> We will try making all nodes accessible, as well as "direct-io-mode=disable".
>>> >>>>
>>> >>>> Thank you.
>>> >>>>
>>> >>>>
>>> >>>> On Sat, 21 Dec 2019 at 10:29, Strahil Nikolov <hunter86_bg at yahoo.com> wrote:
>>> >>>>>
>>> >>>>> Actually I haven't clarified myself.
>>> >>>>> FUSE mounts on the client side is connecting directly to all bricks consisted of the volume.
>>> >>>>> If for some reason (bad routing, firewall blocked) there could be cases where the client can reach 2 out of 3 bricks and this can constantly cause healing to happen (as one of the bricks is never updated) which will degrade the performance and cause excessive network usage.
>>> >>>>> As your attachment is from one of the gluster nodes, this could be the case.
>>> >>>>>
>>> >>>>> Best Regards,
>>> >>>>> Strahil Nikolov
>>> >>>>>
>>> >>>>> В петък, 20 декември 2019 г., 01:49:56 ч. Гринуич+2, David Cunningham <dcunningham at voisonics.com> написа:
>>> >>>>>
>>> >>>>>
>>> >>>>> Hi Strahil,
>>> >>>>>
>>> >>>>> The chart attached to my original email is taken from the GFS server.
>>> >>>>>
>>> >>>>> I'm not sure what you mean by accessing all bricks simultaneously. We've mounted it from the client like this:
>>> >>>>> gfs1:/gvol0 /mnt/glusterfs/ glusterfs defaults,direct-io-mode=disable,_netdev,backupvolfile-server=gfs2,fetch-attempts=10 0 0
>>> >>>>>
>>> >>>>> Should we do something different to access all bricks simultaneously?
>>> >>>>>
>>> >>>>> Thanks for your help!
>>> >>>>>
>>> >>>>>
>>> >>>>> On Fri, 20 Dec 2019 at 11:47, Strahil Nikolov <hunter86_bg at yahoo.com> wrote:
>>> >>>>>>
>>> >>>>>> I'm not sure if you did measure the traffic from client side (tcpdump on a client machine) or from Server side.
>>> >>>>>>
>>> >>>>>> In both cases , please verify that the client accesses all bricks simultaneously, as this can cause unnecessary heals.
>>> >>>>>>
>>> >>>>>> Have you thought about upgrading to v6? There are some enhancements in v6 which could be beneficial.
>>> >>>>>>
>>> >>>>>> Yet, it is indeed strange that so much traffic is generated with FUSE.
>>> >>>>>>
>>> >>>>>> Another aproach is to test with NFSGanesha which suports pNFS and can natively speak with Gluster, which cant bring you closer to the previous setup and also provide some extra performance.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Best Regards,
>>> >>>>>> Strahil Nikolov
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>
>>> >>
>>> >> --
>>> >> David Cunningham, Voisonics Limited
>>> >> http://voisonics.com/
>>> >> USA: +1 213 221 1092
>>> >> New Zealand: +64 (0)28 2558 3782
>>> >
>>> >
>>> >
>>> > --
>>> > David Cunningham, Voisonics Limited
>>> > http://voisonics.com/
>>> > USA: +1 213 221 1092
>>> > New Zealand: +64 (0)28 2558 3782
>>>
>>> Best Regards,
>>> Strahil Nikolov
>>
>>
>>
>> --
>> David Cunningham, Voisonics Limited
>> http://voisonics.com/
>> USA: +1 213 221 1092
>> New Zealand: +64 (0)28 2558 3782
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20191228/9bd038d5/attachment-0003.html>
More information about the Gluster-users
mailing list