[Gluster-users] GFS performance under heavy traffic

David Cunningham dcunningham at voisonics.com
Fri Dec 27 01:23:51 UTC 2019


Oh and I see that the op-version is slightly less than the max-op-version:

[root at gfs1 ~]# gluster volume get all cluster.max-op-version
Option                                  Value

------                                  -----

cluster.max-op-version                  50400


[root at gfs1 ~]# gluster volume get all cluster.op-version
Option                                  Value

------                                  -----

cluster.op-version                      50000



On Fri, 27 Dec 2019 at 14:22, David Cunningham <dcunningham at voisonics.com>
wrote:

> Hi Strahil,
>
> Our volume options are as below. Thanks for the suggestion to upgrade to
> version 6 or 7. We could do that be simply removing the current
> installation and installing the new one (since it's not live right now). We
> might have to convince the customer that it's likely to succeed though, as
> at the moment I think they believe that GFS is not going to work for them.
>
> Option                                  Value
>
> ------                                  -----
>
> cluster.lookup-unhashed                 on
>
> cluster.lookup-optimize                 on
>
> cluster.min-free-disk                   10%
>
> cluster.min-free-inodes                 5%
>
> cluster.rebalance-stats                 off
>
> cluster.subvols-per-directory           (null)
>
> cluster.readdir-optimize                off
>
> cluster.rsync-hash-regex                (null)
>
> cluster.extra-hash-regex                (null)
>
> cluster.dht-xattr-name                  trusted.glusterfs.dht
>
> cluster.randomize-hash-range-by-gfid    off
>
> cluster.rebal-throttle                  normal
>
> cluster.lock-migration                  off
>
> cluster.force-migration                 off
>
> cluster.local-volume-name               (null)
>
> cluster.weighted-rebalance              on
>
> cluster.switch-pattern                  (null)
>
> cluster.entry-change-log                on
>
> cluster.read-subvolume                  (null)
>
> cluster.read-subvolume-index            -1
>
> cluster.read-hash-mode                  1
>
> cluster.background-self-heal-count      8
>
> cluster.metadata-self-heal              on
>
> cluster.data-self-heal                  on
>
> cluster.entry-self-heal                 on
>
> cluster.self-heal-daemon                on
>
> cluster.heal-timeout                    600
>
> cluster.self-heal-window-size           1
>
> cluster.data-change-log                 on
>
> cluster.metadata-change-log             on
>
> cluster.data-self-heal-algorithm        (null)
>
> cluster.eager-lock                      on
>
> disperse.eager-lock                     on
>
> disperse.other-eager-lock               on
>
> disperse.eager-lock-timeout             1
>
> disperse.other-eager-lock-timeout       1
>
> cluster.quorum-type                     none
>
> cluster.quorum-count                    (null)
>
> cluster.choose-local                    true
>
> cluster.self-heal-readdir-size          1KB
>
> cluster.post-op-delay-secs              1
>
> cluster.ensure-durability               on
>
> cluster.consistent-metadata             no
>
> cluster.heal-wait-queue-length          128
>
> cluster.favorite-child-policy           none
>
> cluster.full-lock                       yes
>
> cluster.stripe-block-size               128KB
>
> cluster.stripe-coalesce                 true
>
> diagnostics.latency-measurement         off
>
> diagnostics.dump-fd-stats               off
>
> diagnostics.count-fop-hits              off
>
> diagnostics.brick-log-level             INFO
>
> diagnostics.client-log-level            INFO
>
> diagnostics.brick-sys-log-level         CRITICAL
>
> diagnostics.client-sys-log-level        CRITICAL
>
> diagnostics.brick-logger                (null)
>
> diagnostics.client-logger               (null)
>
> diagnostics.brick-log-format            (null)
>
> diagnostics.client-log-format           (null)
>
> diagnostics.brick-log-buf-size          5
>
> diagnostics.client-log-buf-size         5
>
> diagnostics.brick-log-flush-timeout     120
>
> diagnostics.client-log-flush-timeout    120
>
> diagnostics.stats-dump-interval         0
>
> diagnostics.fop-sample-interval         0
>
> diagnostics.stats-dump-format           json
>
> diagnostics.fop-sample-buf-size         65535
>
> diagnostics.stats-dnscache-ttl-sec      86400
>
> performance.cache-max-file-size         0
>
> performance.cache-min-file-size         0
>
> performance.cache-refresh-timeout       1
>
> performance.cache-priority
>
> performance.cache-size                  32MB
>
> performance.io-thread-count             16
>
> performance.high-prio-threads           16
>
> performance.normal-prio-threads         16
>
> performance.low-prio-threads            16
>
> performance.least-prio-threads          1
>
> performance.enable-least-priority       on
>
> performance.iot-watchdog-secs           (null)
>
> performance.iot-cleanup-disconnected-reqsoff
>
> performance.iot-pass-through            false
>
> performance.io-cache-pass-through       false
>
> performance.cache-size                  128MB
>
> performance.qr-cache-timeout            1
>
> performance.cache-invalidation          false
>
> performance.ctime-invalidation          false
>
> performance.flush-behind                on
>
> performance.nfs.flush-behind            on
>
> performance.write-behind-window-size    1MB
>
> performance.resync-failed-syncs-after-fsyncoff
>
> performance.nfs.write-behind-window-size1MB
>
> performance.strict-o-direct             off
>
> performance.nfs.strict-o-direct         off
>
> performance.strict-write-ordering       off
>
> performance.nfs.strict-write-ordering   off
>
> performance.write-behind-trickling-writeson
>
> performance.aggregate-size              128KB
>
> performance.nfs.write-behind-trickling-writeson
>
> performance.lazy-open                   yes
>
> performance.read-after-open             yes
>
> performance.open-behind-pass-through    false
>
> performance.read-ahead-page-count       4
>
> performance.read-ahead-pass-through     false
>
> performance.readdir-ahead-pass-through  false
>
> performance.md-cache-pass-through       false
>
> performance.md-cache-timeout            1
>
> performance.cache-swift-metadata        true
>
> performance.cache-samba-metadata        false
>
> performance.cache-capability-xattrs     true
>
> performance.cache-ima-xattrs            true
>
> performance.md-cache-statfs             off
>
> performance.xattr-cache-list
>
> performance.nl-cache-pass-through       false
>
> features.encryption                     off
>
> encryption.master-key                   (null)
>
> encryption.data-key-size                256
>
> encryption.block-size                   4096
>
> network.frame-timeout                   1800
>
> network.ping-timeout                    42
>
> network.tcp-window-size                 (null)
>
> network.remote-dio                      disable
>
> client.event-threads                    2
>
> client.tcp-user-timeout                 0
>
> client.keepalive-time                   20
>
> client.keepalive-interval               2
>
> client.keepalive-count                  9
>
> network.tcp-window-size                 (null)
>
> network.inode-lru-limit                 16384
>
> auth.allow                              *
>
> auth.reject                             (null)
>
> transport.keepalive                     1
>
> server.allow-insecure                   on
>
> server.root-squash                      off
>
> server.anonuid                          65534
>
> server.anongid                          65534
>
> server.statedump-path                   /var/run/gluster
>
> server.outstanding-rpc-limit            64
>
> server.ssl                              (null)
>
> auth.ssl-allow                          *
>
> server.manage-gids                      off
>
> server.dynamic-auth                     on
>
> client.send-gids                        on
>
> server.gid-timeout                      300
>
> server.own-thread                       (null)
>
> server.event-threads                    1
>
> server.tcp-user-timeout                 0
>
> server.keepalive-time                   20
>
> server.keepalive-interval               2
>
> server.keepalive-count                  9
>
> transport.listen-backlog                1024
>
> ssl.own-cert                            (null)
>
> ssl.private-key                         (null)
>
> ssl.ca-list                             (null)
>
> ssl.crl-path                            (null)
>
> ssl.certificate-depth                   (null)
>
> ssl.cipher-list                         (null)
>
> ssl.dh-param                            (null)
>
> ssl.ec-curve                            (null)
>
> transport.address-family                inet
>
> performance.write-behind                on
>
> performance.read-ahead                  on
>
> performance.readdir-ahead               on
>
> performance.io-cache                    on
>
> performance.quick-read                  on
>
> performance.open-behind                 on
>
> performance.nl-cache                    off
>
> performance.stat-prefetch               on
>
> performance.client-io-threads           off
>
> performance.nfs.write-behind            on
>
> performance.nfs.read-ahead              off
>
> performance.nfs.io-cache                off
>
> performance.nfs.quick-read              off
>
> performance.nfs.stat-prefetch           off
>
> performance.nfs.io-threads              off
>
> performance.force-readdirp              true
>
> performance.cache-invalidation          false
>
> features.uss                            off
>
> features.snapshot-directory             .snaps
>
> features.show-snapshot-directory        off
>
> features.tag-namespaces                 off
>
> network.compression                     off
>
> network.compression.window-size         -15
>
> network.compression.mem-level           8
>
> network.compression.min-size            0
>
> network.compression.compression-level   -1
>
> network.compression.debug               false
>
> features.default-soft-limit             80%
>
> features.soft-timeout                   60
>
> features.hard-timeout                   5
>
> features.alert-time                     86400
>
> features.quota-deem-statfs              off
>
> geo-replication.indexing                off
>
> geo-replication.indexing                off
>
> geo-replication.ignore-pid-check        off
>
> geo-replication.ignore-pid-check        off
>
> features.quota                          off
>
> features.inode-quota                    off
>
> features.bitrot                         disable
>
> debug.trace                             off
>
> debug.log-history                       no
>
> debug.log-file                          no
>
> debug.exclude-ops                       (null)
>
> debug.include-ops                       (null)
>
> debug.error-gen                         off
>
> debug.error-failure                     (null)
>
> debug.error-number                      (null)
>
> debug.random-failure                    off
>
> debug.error-fops                        (null)
>
> nfs.disable                             on
>
> features.read-only                      off
>
> features.worm                           off
>
> features.worm-file-level                off
>
> features.worm-files-deletable           on
>
> features.default-retention-period       120
>
> features.retention-mode                 relax
>
> features.auto-commit-period             180
>
> storage.linux-aio                       off
>
> storage.batch-fsync-mode                reverse-fsync
>
> storage.batch-fsync-delay-usec          0
>
> storage.owner-uid                       -1
>
> storage.owner-gid                       -1
>
> storage.node-uuid-pathinfo              off
>
> storage.health-check-interval           30
>
> storage.build-pgfid                     off
>
> storage.gfid2path                       on
>
> storage.gfid2path-separator             :
>
> storage.reserve                         1
>
> storage.health-check-timeout            10
>
> storage.fips-mode-rchecksum             off
>
> storage.force-create-mode               0000
>
> storage.force-directory-mode            0000
>
> storage.create-mask                     0777
>
> storage.create-directory-mask           0777
>
> storage.max-hardlinks                   100
>
> storage.ctime                           off
>
> storage.bd-aio                          off
>
> config.gfproxyd                         off
>
> cluster.server-quorum-type              off
>
> cluster.server-quorum-ratio             0
>
> changelog.changelog                     off
>
> changelog.changelog-dir                 {{ brick.path
> }}/.glusterfs/changelogs
> changelog.encoding                      ascii
>
> changelog.rollover-time                 15
>
> changelog.fsync-interval                5
>
> changelog.changelog-barrier-timeout     120
>
> changelog.capture-del-path              off
>
> features.barrier                        disable
>
> features.barrier-timeout                120
>
> features.trash                          off
>
> features.trash-dir                      .trashcan
>
> features.trash-eliminate-path           (null)
>
> features.trash-max-filesize             5MB
>
> features.trash-internal-op              off
>
> cluster.enable-shared-storage           disable
>
> cluster.write-freq-threshold            0
>
> cluster.read-freq-threshold             0
>
> cluster.tier-pause                      off
>
> cluster.tier-promote-frequency          120
>
> cluster.tier-demote-frequency           3600
>
> cluster.watermark-hi                    90
>
> cluster.watermark-low                   75
>
> cluster.tier-mode                       cache
>
> cluster.tier-max-promote-file-size      0
>
> cluster.tier-max-mb                     4000
>
> cluster.tier-max-files                  10000
>
> cluster.tier-query-limit                100
>
> cluster.tier-compact                    on
>
> cluster.tier-hot-compact-frequency      604800
>
> cluster.tier-cold-compact-frequency     604800
>
> features.ctr-enabled                    off
>
> features.record-counters                off
>
> features.ctr-record-metadata-heat       off
>
> features.ctr_link_consistency           off
>
> features.ctr_lookupheal_link_timeout    300
>
> features.ctr_lookupheal_inode_timeout   300
>
> features.ctr-sql-db-cachesize           12500
>
> features.ctr-sql-db-wal-autocheckpoint  25000
>
> features.selinux                        on
>
> locks.trace                             off
>
> locks.mandatory-locking                 off
>
> cluster.disperse-self-heal-daemon       enable
>
> cluster.quorum-reads                    no
>
> client.bind-insecure                    (null)
>
> features.shard                          off
>
> features.shard-block-size               64MB
>
> features.shard-lru-limit                16384
>
> features.shard-deletion-rate            100
>
> features.scrub-throttle                 lazy
>
> features.scrub-freq                     biweekly
>
> features.scrub                          false
>
> features.expiry-time                    120
>
> features.cache-invalidation             off
>
> features.cache-invalidation-timeout     60
>
> features.leases                         off
>
> features.lease-lock-recall-timeout      60
>
> disperse.background-heals               8
>
> disperse.heal-wait-qlength              128
>
> cluster.heal-timeout                    600
>
> dht.force-readdirp                      on
>
> disperse.read-policy                    gfid-hash
>
> cluster.shd-max-threads                 1
>
> cluster.shd-wait-qlength                1024
>
> cluster.locking-scheme                  full
>
> cluster.granular-entry-heal             no
>
> features.locks-revocation-secs          0
>
> features.locks-revocation-clear-all     false
>
> features.locks-revocation-max-blocked   0
>
> features.locks-monkey-unlocking         false
>
> features.locks-notify-contention        no
>
> features.locks-notify-contention-delay  5
>
> disperse.shd-max-threads                1
>
> disperse.shd-wait-qlength               1024
>
> disperse.cpu-extensions                 auto
>
> disperse.self-heal-window-size          1
>
> cluster.use-compound-fops               off
>
> performance.parallel-readdir            off
>
> performance.rda-request-size            131072
>
> performance.rda-low-wmark               4096
>
> performance.rda-high-wmark              128KB
>
> performance.rda-cache-limit             10MB
>
> performance.nl-cache-positive-entry     false
>
> performance.nl-cache-limit              10MB
>
> performance.nl-cache-timeout            60
>
> cluster.brick-multiplex                 off
>
> cluster.max-bricks-per-process          0
>
> disperse.optimistic-change-log          on
>
> disperse.stripe-cache                   4
>
> cluster.halo-enabled                    False
>
> cluster.halo-shd-max-latency            99999
>
> cluster.halo-nfsd-max-latency           5
>
> cluster.halo-max-latency                5
>
> cluster.halo-max-replicas               99999
>
> cluster.halo-min-replicas               2
>
> cluster.daemon-log-level                INFO
>
> debug.delay-gen                         off
>
> delay-gen.delay-percentage              10%
>
> delay-gen.delay-duration                100000
>
> delay-gen.enable
>
> disperse.parallel-writes                on
>
> features.sdfs                           on
>
> features.cloudsync                      off
>
> features.utime                          off
>
> ctime.noatime                           on
>
> feature.cloudsync-storetype             (null)
>
>
> Thanks again.
>
>
> On Wed, 25 Dec 2019 at 05:51, Strahil <hunter86_bg at yahoo.com> wrote:
>
>> Hi David,
>>
>> On Dec 24, 2019 02:47, David Cunningham <dcunningham at voisonics.com>
>> wrote:
>> >
>> > Hello,
>> >
>> > In testing we found that actually the GFS client having access to all 3
>> nodes made no difference to performance. Perhaps that's because the 3rd
>> node that wasn't accessible from the client before was the arbiter node?
>> It makes sense, as no data is being generated towards the arbiter.
>> > Presumably we shouldn't have an arbiter node listed under
>> backupvolfile-server when mounting the filesystem? Since it doesn't store
>> all the data surely it can't be used to serve the data.
>>
>> I have my arbiter defined as last backup and no issues so far. At least
>> the admin can easily identify the bricks from the mount options.
>>
>> > We did have direct-io-mode=disable already as well, so that wasn't a
>> factor in the performance problems.
>>
>> Have you checked if the client vedsion ia not too old.
>> Also you can check the cluster's  operation cersion:
>> # gluster volume get all cluster.max-op-version
>> # gluster volume get all cluster.op-version
>>
>> Cluster's op version should be at max-op-version.
>>
>> In my mind come 2  options:
>> A) Upgrade to latest GLUSTER v6 or even v7 ( I know it won't be easy) and
>> then set the op version to highest possible.
>> # gluster volume get all cluster.max-op-version
>> # gluster volume get all cluster.op-version
>>
>> B)  Deploy a NFS Ganesha server and connect the client over NFS v4.2 (and
>> control the parallel connections from Ganesha).
>>
>> Can you provide your  Gluster volume's  options?
>> 'gluster volume get <VOLNAME>  all'
>>
>> > Thanks again for any advice.
>> >
>> >
>> >
>> > On Mon, 23 Dec 2019 at 13:09, David Cunningham <
>> dcunningham at voisonics.com> wrote:
>> >>
>> >> Hi Strahil,
>> >>
>> >> Thanks for that. We do have one backup server specified, but will add
>> the second backup as well.
>> >>
>> >>
>> >> On Sat, 21 Dec 2019 at 11:26, Strahil <hunter86_bg at yahoo.com> wrote:
>> >>>
>> >>> Hi David,
>> >>>
>> >>> Also consider using the  mount option to specify backup server via
>> 'backupvolfile-server=server2:server3' (you can define more but I don't
>> thing replica volumes  greater that 3 are usefull (maybe  in some special
>> cases).
>> >>>
>> >>> In such way, when the primary is lost, your client can reach a backup
>> one without disruption.
>> >>>
>> >>> P.S.: Client may 'hang' - if the primary server got rebooted
>> ungracefully - as the communication must timeout before FUSE addresses the
>> next server. There is a special script for  killing gluster processes in
>> '/usr/share/gluster/scripts' which can be used  for  setting up a systemd
>> service to do that for you on shutdown.
>> >>>
>> >>> Best Regards,
>> >>> Strahil Nikolov
>> >>>
>> >>> On Dec 20, 2019 23:49, David Cunningham <dcunningham at voisonics.com>
>> wrote:
>> >>>>
>> >>>> Hi Stahil,
>> >>>>
>> >>>> Ah, that is an important point. One of the nodes is not accessible
>> from the client, and we assumed that it only needed to reach the GFS node
>> that was mounted so didn't think anything of it.
>> >>>>
>> >>>> We will try making all nodes accessible, as well as
>> "direct-io-mode=disable".
>> >>>>
>> >>>> Thank you.
>> >>>>
>> >>>>
>> >>>> On Sat, 21 Dec 2019 at 10:29, Strahil Nikolov <hunter86_bg at yahoo.com>
>> wrote:
>> >>>>>
>> >>>>> Actually I haven't clarified myself.
>> >>>>> FUSE mounts on the client side is connecting directly to all bricks
>> consisted of the volume.
>> >>>>> If for some reason (bad routing, firewall blocked) there could be
>> cases where the client can reach 2 out of 3 bricks and this can constantly
>> cause healing to happen (as one of the bricks is never updated) which will
>> degrade the performance and cause excessive network usage.
>> >>>>> As your attachment is from one of the gluster nodes, this could be
>> the case.
>> >>>>>
>> >>>>> Best Regards,
>> >>>>> Strahil Nikolov
>> >>>>>
>> >>>>> В петък, 20 декември 2019 г., 01:49:56 ч. Гринуич+2, David
>> Cunningham <dcunningham at voisonics.com> написа:
>> >>>>>
>> >>>>>
>> >>>>> Hi Strahil,
>> >>>>>
>> >>>>> The chart attached to my original email is taken from the GFS
>> server.
>> >>>>>
>> >>>>> I'm not sure what you mean by accessing all bricks simultaneously.
>> We've mounted it from the client like this:
>> >>>>> gfs1:/gvol0 /mnt/glusterfs/ glusterfs
>> defaults,direct-io-mode=disable,_netdev,backupvolfile-server=gfs2,fetch-attempts=10
>> 0 0
>> >>>>>
>> >>>>> Should we do something different to access all bricks
>> simultaneously?
>> >>>>>
>> >>>>> Thanks for your help!
>> >>>>>
>> >>>>>
>> >>>>> On Fri, 20 Dec 2019 at 11:47, Strahil Nikolov <
>> hunter86_bg at yahoo.com> wrote:
>> >>>>>>
>> >>>>>> I'm not sure if you did measure the traffic from client side
>> (tcpdump on a client machine) or from Server side.
>> >>>>>>
>> >>>>>> In both cases , please verify that the client accesses all bricks
>> simultaneously, as this can cause unnecessary heals.
>> >>>>>>
>> >>>>>> Have you thought about upgrading to v6? There are some
>> enhancements in v6 which could be beneficial.
>> >>>>>>
>> >>>>>> Yet, it is indeed strange that so much traffic is generated with
>> FUSE.
>> >>>>>>
>> >>>>>> Another aproach is to test with NFSGanesha which suports pNFS and
>> can natively speak with Gluster, which cant bring you closer to the
>> previous setup and also provide some extra performance.
>> >>>>>>
>> >>>>>>
>> >>>>>> Best Regards,
>> >>>>>> Strahil Nikolov
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>
>> >>
>> >> --
>> >> David Cunningham, Voisonics Limited
>> >> http://voisonics.com/
>> >> USA: +1 213 221 1092
>> >> New Zealand: +64 (0)28 2558 3782
>> >
>> >
>> >
>> > --
>> > David Cunningham, Voisonics Limited
>> > http://voisonics.com/
>> > USA: +1 213 221 1092
>> > New Zealand: +64 (0)28 2558 3782
>>
>> Best Regards,
>> Strahil Nikolov
>>
>
>
> --
> David Cunningham, Voisonics Limited
> http://voisonics.com/
> USA: +1 213 221 1092
> New Zealand: +64 (0)28 2558 3782
>


-- 
David Cunningham, Voisonics Limited
http://voisonics.com/
USA: +1 213 221 1092
New Zealand: +64 (0)28 2558 3782
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20191227/4883a553/attachment.html>


More information about the Gluster-users mailing list