[Gluster-users] upgrade best practices
Soumya Koduri
skoduri at redhat.com
Mon Apr 1 18:37:18 UTC 2019
Thanks for the details. Response inline -
On 4/1/19 9:45 PM, Jim Kinney wrote:
> On Sun, 2019-03-31 at 23:01 +0530, Soumya Koduri wrote:
>>
>> On 3/29/19 10:39 PM, Poornima Gurusiddaiah wrote:
>>>
>>>
>>> On Fri, Mar 29, 2019, 10:03 PM Jim Kinney <
>>> jim.kinney at gmail.com
>>> <mailto:jim.kinney at gmail.com>
>>>
>>> <mailto:
>>> jim.kinney at gmail.com
>>> <mailto:jim.kinney at gmail.com>
>>> >> wrote:
>>>
>>> Currently running 3.12 on Centos 7.6. Doing cleanups on split-brain
>>> and out of sync, need heal files.
>>>
>>> We need to migrate the three replica servers to gluster v. 5 or 6.
>>> Also will need to upgrade about 80 clients as well. Given that a
>>> complete removal of gluster will not touch the 200+TB of data on 12
>>> volumes, we are looking at doing that process, Stop all clients,
>>> stop all glusterd services, remove all of it, install new version,
>>> setup new volumes from old bricks, install new clients, mount
>>> everything.
>>>
>>> We would like to get some better performance from nfs-ganesha mounts
>>> but that doesn't look like an option (not done any parameter tweaks
>>> in testing yet). At a bare minimum, we would like to minimize the
>>> total downtime of all systems.
>>
>> Could you please be more specific here? As in are you looking for better
>> performance during upgrade process or in general? Compared to 3.12,
>> there are lot of perf improvements done in both glusterfs and esp.,
>> nfs-ganesha (latest stable - V2.7.x) stack. If you could provide more
>> information about your workloads (for eg., large-file,small-files,
>> metadata-intensive) , we can make some recommendations wrt to configuration.
>
> Sure. More details:
>
> We are (soon to be) running a three-node replica only gluster service (2
> nodes now, third is racked and ready for sync and being added to gluster
> cluster). Each node has 2 external drive arrays plus one internal. Each
> node has 40G IB plus 40G IP connections (plans to upgrade to 100G). We
> currently have 9 volumes and each is 7TB up to 50TB of space. Each
> volume is a mix of thousands of large (>1GB) and tens of thousands of
> small (~100KB) plus thousands inbetween.
>
> Currently we have a 13-node computational cluster with varying GPU
> abilities that mounts all of these volumes using gluster-fuse. Writes
> are slow and reads are also as if from a single server. I have data from
> a test setup (not anywhere near the capacity of the production system -
> just for testing commands and recoveries) that indicates raw NFS is much
> faster but no gluster, gluster-fuse is much slower. We have mmap issues
> with python and fuse-mounted locations. Converting to NFS solves this.
> We have tinkered with kernel settings to handle oom-killer so it will no
> longer drop glusterfs when an errant job eat all the ram (set
> oom_score_adj - -1000 for all glusterfs pids).
Have you tried tuning any perf parameters? From the volume options you
have shared below, I see that there is scope to improve performance (for
eg., by enabling md-cache parameters and parallel-readdir, metadata
related operations latency can be improved). Request Poornima, Xavi or
Du to comment on recommended values for better I/O throughput for your
workload.
>
> We would like to transition (smoothly!!) to gluster 5 or 6 with
> nfs-ganesha 2.7 and see some performance improvements. We will be using
> corosync and pacemaker for NFS failover. It would be fantastic be able
> to saturate a 10G IPoIB (or 40G IB !) connection to each compute node in
> the current computational cluster. Right now we absolutely can't get
> much write speed ( copy a 6.2GB file from host to gluster storage took
> 1m 21s. cp from disk to /dev/null is 7s). cp from gluster to /dev/null
> is 1.0m (same 6.2GB file). That's a 10Gbps IPoIB connection at only 800Mbps.
Few things to note here -
* The volume option "nfs.disable" command refers to GluscterNFS service
which is being deprecated and not enabled by default in the latest
gluster versions available (like in gluster 5 & 6). We recommend
NFS-Ganesha and hence this option needs to be turned on (to disable
GlusterNFS)
* Starting from Gluster 3.11 , HA configuration bits for NFS-Ganesha
have been removed from gluster codebase. So you would need to either
manually configure any HA service on top of NFS-Ganesha servers or use
storhaug [1] to configure the same.
* Coming to technical aspects, by switching to 'NFS', you could benefit
from heavy caching done by NFS client and few other optimizations it
does. Even NFS-Ganesha server does metadata caching and resides on the
same nodes as the glusterfs servers. Apart from these, NFS-Ganesha acts
like any other glusterfs client (but by making use of libgfapi and not
fuse mount). It would be interesting to check if and how much
improvement you get with 'NFS' when compared to fuse protocol for your
workload. Please let us know when you have the test environment ready.
Will make recommendations wrt to few settings for NFS-Ganesha server
and client.
Thanks,
Soumya
[1] https://github.com/linux-ha-storage/storhaug
>
> We would like to do things like enable SSL encryption of all data flows
> (we deal with PHI data in a HIPAA-regulated setting) but are concerned
> about performance. We are running dual Intel Xeon E5-2630L (12 physical
> cores each @ 2.4GHz) and 128GB RAM in each server node. We have 170
> users. About 20 are active at any time.
>
> The current setting on /home (others are similar if not identical, maybe
> nfs-disable is true for others):
>
> gluster volume get home all
> Option Value
> ------ -----
> cluster.lookup-unhashed on
> cluster.lookup-optimize off
> cluster.min-free-disk 10%
> cluster.min-free-inodes 5%
> cluster.rebalance-stats off
> cluster.subvols-per-directory (null)
> cluster.readdir-optimize off
> cluster.rsync-hash-regex (null)
> cluster.extra-hash-regex (null)
> cluster.dht-xattr-name trusted.glusterfs.dht
> cluster.randomize-hash-range-by-gfid off
> cluster.rebal-throttle normal
> cluster.lock-migration off
> cluster.local-volume-name (null)
> cluster.weighted-rebalance on
> cluster.switch-pattern (null)
> cluster.entry-change-log on
> cluster.read-subvolume (null)
> cluster.read-subvolume-index -1
> cluster.read-hash-mode 1
> cluster.background-self-heal-count 8
> cluster.metadata-self-heal on
> cluster.data-self-heal on
> cluster.entry-self-heal on
> cluster.self-heal-daemon enable
> cluster.heal-timeout 600
> cluster.self-heal-window-size 1
> cluster.data-change-log on
> cluster.metadata-change-log on
> cluster.data-self-heal-algorithm (null)
> cluster.eager-lock on
> disperse.eager-lock on
> cluster.quorum-type none
> cluster.quorum-count (null)
> cluster.choose-local true
> cluster.self-heal-readdir-size 1KB
> cluster.post-op-delay-secs 1
> cluster.ensure-durability on
> cluster.consistent-metadata no
> cluster.heal-wait-queue-length 128
> cluster.favorite-child-policy none
> cluster.stripe-block-size 128KB
> cluster.stripe-coalesce true
> diagnostics.latency-measurement off
> diagnostics.dump-fd-stats off
> diagnostics.count-fop-hits off
> diagnostics.brick-log-level INFO
> diagnostics.client-log-level INFO
> diagnostics.brick-sys-log-level CRITICAL
> diagnostics.client-sys-log-level CRITICAL
> diagnostics.brick-logger (null)
> diagnostics.client-logger (null)
> diagnostics.brick-log-format (null)
> diagnostics.client-log-format (null)
> diagnostics.brick-log-buf-size 5
> diagnostics.client-log-buf-size 5
> diagnostics.brick-log-flush-timeout 120
> diagnostics.client-log-flush-timeout 120
> diagnostics.stats-dump-interval 0
> diagnostics.fop-sample-interval 0
> diagnostics.stats-dump-format json
> diagnostics.fop-sample-buf-size 65535
> diagnostics.stats-dnscache-ttl-sec 86400
> performance.cache-max-file-size 0
> performance.cache-min-file-size 0
> performance.cache-refresh-timeout 1
> performance.cache-priority
> performance.cache-size 32MB
> performance.io-thread-count 16
> performance.high-prio-threads 16
> performance.normal-prio-threads 16
> performance.low-prio-threads 16
> performance.least-prio-threads 1
> performance.enable-least-priority on
> performance.cache-size 128MB
> performance.flush-behind on
> performance.nfs.flush-behind on
> performance.write-behind-window-size 1MB
> performance.resync-failed-syncs-after-fsyncoff
> performance.nfs.write-behind-window-size1MB
> performance.strict-o-direct off
> performance.nfs.strict-o-direct off
> performance.strict-write-ordering off
> performance.nfs.strict-write-ordering off
> performance.lazy-open yes
> performance.read-after-open no
> performance.read-ahead-page-count 4
> performance.md-cache-timeout 1
> performance.cache-swift-metadata true
> performance.cache-samba-metadata false
> performance.cache-capability-xattrs true
> performance.cache-ima-xattrs true
> features.encryption off
> encryption.master-key (null)
> encryption.data-key-size 256
> encryption.block-size 4096
> network.frame-timeout 1800
> network.ping-timeout 42
> network.tcp-window-size (null)
> features.lock-heal off
> features.grace-timeout 10
> network.remote-dio disable
> client.event-threads 2
> client.tcp-user-timeout 0
> client.keepalive-time 20
> client.keepalive-interval 2
> client.keepalive-count 9
> network.tcp-window-size (null)
> network.inode-lru-limit 16384
> auth.allow *
> auth.reject (null)
> transport.keepalive 1
> server.allow-insecure (null)
> server.root-squash off
> server.anonuid 65534
> server.anongid 65534
> server.statedump-path /var/run/gluster
> server.outstanding-rpc-limit 64
> features.lock-heal off
> features.grace-timeout 10
> server.ssl (null)
> auth.ssl-allow *
> server.manage-gids off
> server.dynamic-auth on
> client.send-gids on
> server.gid-timeout 300
> server.own-thread (null)
> server.event-threads 1
> server.tcp-user-timeout 0
> server.keepalive-time 20
> server.keepalive-interval 2
> server.keepalive-count 9
> transport.listen-backlog 10
> ssl.own-cert (null)
> ssl.private-key (null)
> ssl.ca-list (null)
> ssl.crl-path (null)
> ssl.certificate-depth (null)
> ssl.cipher-list (null)
> ssl.dh-param (null)
> ssl.ec-curve (null)
> performance.write-behind on
> performance.read-ahead on
> performance.readdir-ahead off
> performance.io-cache on
> performance.quick-read on
> performance.open-behind on
> performance.nl-cache off
> performance.stat-prefetch on
> performance.client-io-threads off
> performance.nfs.write-behind on
> performance.nfs.read-ahead off
> performance.nfs.io-cache off
> performance.nfs.quick-read off
> performance.nfs.stat-prefetch off
> performance.nfs.io-threads off
> performance.force-readdirp true
> performance.cache-invalidation false
> features.uss off
> features.snapshot-directory .snaps
> features.show-snapshot-directory off
> network.compression off
> network.compression.window-size -15
> network.compression.mem-level 8
> network.compression.min-size 0
> network.compression.compression-level -1
> network.compression.debug false
> features.limit-usage (null)
> features.default-soft-limit 80%
> features.soft-timeout 60
> features.hard-timeout 5
> features.alert-time 86400
> features.quota-deem-statfs off
> geo-replication.indexing off
> geo-replication.indexing off
> geo-replication.ignore-pid-check off
> geo-replication.ignore-pid-check off
> features.quota off
> features.inode-quota off
> features.bitrot disable
> debug.trace off
> debug.log-history no
> debug.log-file no
> debug.exclude-ops (null)
> debug.include-ops (null)
> debug.error-gen off
> debug.error-failure (null)
> debug.error-number (null)
> debug.random-failure off
> debug.error-fops (null)
> nfs.enable-ino32 no
> nfs.mem-factor 15
> nfs.export-dirs on
> nfs.export-volumes on
> nfs.addr-namelookup off
> nfs.dynamic-volumes off
> nfs.register-with-portmap on
> nfs.outstanding-rpc-limit 16
> nfs.port 2049
> nfs.rpc-auth-unix on
> nfs.rpc-auth-null on
> nfs.rpc-auth-allow all
> nfs.rpc-auth-reject none
> nfs.ports-insecure off
> nfs.trusted-sync off
> nfs.trusted-write off
> nfs.volume-access read-write
> nfs.export-dir
> nfs.disable off
> nfs.nlm on
> nfs.acl on
> nfs.mount-udp off
> nfs.mount-rmtab /var/lib/glusterd/nfs/rmtab
> nfs.rpc-statd /sbin/rpc.statd
> nfs.server-aux-gids off
> nfs.drc off
> nfs.drc-size 0x20000
> nfs.read-size (1 * 1048576ULL)
> nfs.write-size (1 * 1048576ULL)
> nfs.readdir-size (1 * 1048576ULL)
> nfs.rdirplus on
> nfs.exports-auth-enable (null)
> nfs.auth-refresh-interval-sec (null)
> nfs.auth-cache-ttl-sec (null)
> features.read-only off
> features.worm off
> features.worm-file-level off
> features.default-retention-period 120
> features.retention-mode relax
> features.auto-commit-period 180
> storage.linux-aio off
> storage.batch-fsync-mode reverse-fsync
> storage.batch-fsync-delay-usec 0
> storage.owner-uid -1
> storage.owner-gid -1
> storage.node-uuid-pathinfo off
> storage.health-check-interval 30
> storage.build-pgfid on
> storage.gfid2path on
> storage.gfid2path-separator :
> storage.bd-aio off
> cluster.server-quorum-type off
> cluster.server-quorum-ratio 0
> changelog.changelog off
> changelog.changelog-dir (null)
> changelog.encoding ascii
> changelog.rollover-time 15
> changelog.fsync-interval 5
> changelog.changelog-barrier-timeout 120
> changelog.capture-del-path off
> features.barrier disable
> features.barrier-timeout 120
> features.trash off
> features.trash-dir .trashcan
> features.trash-eliminate-path (null)
> features.trash-max-filesize 5MB
> features.trash-internal-op off
> cluster.enable-shared-storage disable
> cluster.write-freq-threshold 0
> cluster.read-freq-threshold 0
> cluster.tier-pause off
> cluster.tier-promote-frequency 120
> cluster.tier-demote-frequency 3600
> cluster.watermark-hi 90
> cluster.watermark-low 75
> cluster.tier-mode cache
> cluster.tier-max-promote-file-size 0
> cluster.tier-max-mb 4000
> cluster.tier-max-files 10000
> cluster.tier-query-limit 100
> cluster.tier-compact on
> cluster.tier-hot-compact-frequency 604800
> cluster.tier-cold-compact-frequency 604800
> features.ctr-enabled off
> features.record-counters off
> features.ctr-record-metadata-heat off
> features.ctr_link_consistency off
> features.ctr_lookupheal_link_timeout 300
> features.ctr_lookupheal_inode_timeout 300
> features.ctr-sql-db-cachesize 12500
> features.ctr-sql-db-wal-autocheckpoint 25000
> features.selinux on
> locks.trace off
> locks.mandatory-locking off
> cluster.disperse-self-heal-daemon enable
> cluster.quorum-reads no
> client.bind-insecure (null)
> features.shard off
> features.shard-block-size 64MB
> features.scrub-throttle lazy
> features.scrub-freq biweekly
> features.scrub false
> features.expiry-time 120
> features.cache-invalidation off
> features.cache-invalidation-timeout 60
> features.leases off
> features.lease-lock-recall-timeout 60
> disperse.background-heals 8
> disperse.heal-wait-qlength 128
> cluster.heal-timeout 600
> dht.force-readdirp on
> disperse.read-policy gfid-hash
> cluster.shd-max-threads 1
> cluster.shd-wait-qlength 1024
> cluster.locking-scheme full
> cluster.granular-entry-heal no
> features.locks-revocation-secs 0
> features.locks-revocation-clear-all false
> features.locks-revocation-max-blocked 0
> features.locks-monkey-unlocking false
> disperse.shd-max-threads 1
> disperse.shd-wait-qlength 1024
> disperse.cpu-extensions auto
> disperse.self-heal-window-size 1
> cluster.use-compound-fops off
> performance.parallel-readdir off
> performance.rda-request-size 131072
> performance.rda-low-wmark 4096
> performance.rda-high-wmark 128KB
> performance.rda-cache-limit 10MB
> performance.nl-cache-positive-entry false
> performance.nl-cache-limit 10MB
> performance.nl-cache-timeout 60
> cluster.brick-multiplex off
> cluster.max-bricks-per-process 0
> disperse.optimistic-change-log on
> cluster.halo-enabled False
> cluster.halo-shd-max-latency 99999
> cluster.halo-nfsd-max-latency 5
> cluster.halo-max-latency 5
> cluster.halo-max-replicas
>>
>> Thanks,
>> Soumya
>>
>>>
>>> Does this process make more sense than a version upgrade process to
>>> 4.1, then 5, then 6? What "gotcha's" do I need to be ready for? I
>>> have until late May to prep and test on old, slow hardware with a
>>> small amount of files and volumes.
>>>
>>>
>>> You can directly upgrade from 3.12 to 6.x. I would suggest that rather
>>> than deleting and creating Gluster volume. +Hari and +Sanju for further
>>> guidelines on upgrade, as they recently did upgrade tests. +Soumya to
>>> add to the nfs-ganesha aspect.
>>>
>>> Regards,
>>> Poornima
>>>
>>> --
>>>
>>> James P. Kinney III
>>>
>>> Every time you stop a school, you will have to build a jail. What you
>>> gain at one end you lose at the other. It's like feeding a dog on his
>>> own tail. It won't fatten the dog.
>>> - Speech 11/23/1900 Mark Twain
>>>
>>>
>>> http://heretothereideas.blogspot.com/
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>>
>>> Gluster-users at gluster.org
>>> <mailto:Gluster-users at gluster.org>
>>> <mailto:
>>> Gluster-users at gluster.org
>>> <mailto:Gluster-users at gluster.org>
>>> >
>>>
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
> --
>
> James P. Kinney III
>
> Every time you stop a school, you will have to build a jail. What you
> gain at one end you lose at the other. It's like feeding a dog on his
> own tail. It won't fatten the dog.
> - Speech 11/23/1900 Mark Twain
>
> http://heretothereideas.blogspot.com/
>
More information about the Gluster-users
mailing list