[Gluster-users] upgrade best practices

Mon Apr 1 18:37:18 UTC 2019

Thanks for the details. Response inline -

On 4/1/19 9:45 PM, Jim Kinney wrote:
> On Sun, 2019-03-31 at 23:01 +0530, Soumya Koduri wrote:
>>
>> On 3/29/19 10:39 PM, Poornima Gurusiddaiah wrote:
>>>
>>>
>>> On Fri, Mar 29, 2019, 10:03 PM Jim Kinney <
>>> jim.kinney at gmail.com
>>>  <mailto:jim.kinney at gmail.com>
>>>   
>>> <mailto:
>>> jim.kinney at gmail.com
>>>  <mailto:jim.kinney at gmail.com>
>>> >> wrote:
>>>
>>>      Currently running 3.12 on Centos 7.6. Doing cleanups on split-brain
>>>      and out of sync, need heal files.
>>>
>>>      We need to migrate the three replica servers to gluster v. 5 or 6.
>>>      Also will need to upgrade about 80 clients as well. Given that a
>>>      complete removal of gluster will not touch the 200+TB of data on 12
>>>      volumes, we are looking at doing that process, Stop all clients,
>>>      stop all glusterd services, remove all of it, install new version,
>>>      setup new volumes from old bricks, install new clients, mount
>>>      everything.
>>>
>>>      We would like to get some better performance from nfs-ganesha mounts
>>>      but that doesn't look like an option (not done any parameter tweaks
>>>      in testing yet). At a bare minimum, we would like to minimize the
>>>      total downtime of all systems.
>>
>> Could you please be more specific here? As in are you looking for better
>> performance during upgrade process or in general? Compared to 3.12,
>> there are lot of perf improvements done in both glusterfs and esp.,
>> nfs-ganesha (latest stable - V2.7.x) stack. If you could provide more
>> information about your workloads (for eg., large-file,small-files,
>> metadata-intensive) , we can make some recommendations wrt to configuration.
> 
> Sure. More details:
> 
> We are (soon to be) running a three-node replica only gluster service (2 
> nodes now, third is racked and ready for sync and being added to gluster 
> cluster). Each node has 2 external drive arrays plus one internal. Each 
> node has 40G IB plus 40G IP connections (plans to upgrade to 100G). We 
> currently have 9 volumes and each is 7TB up to 50TB of space. Each 
> volume is a mix of thousands of large (>1GB) and tens of thousands of 
> small (~100KB) plus thousands inbetween.
> 
> Currently we have a 13-node computational cluster with varying GPU 
> abilities that mounts all of these volumes using gluster-fuse. Writes 
> are slow and reads are also as if from a single server. I have data from 
> a test setup (not anywhere near the capacity of the production system - 
> just for testing commands and recoveries) that indicates raw NFS is much 
> faster but no gluster, gluster-fuse is much slower. We have mmap issues 
> with python and fuse-mounted locations. Converting to NFS solves this. 
> We have tinkered with kernel settings to handle oom-killer so it will no 
> longer drop glusterfs when an errant job eat all the ram (set 
> oom_score_adj - -1000 for all glusterfs pids).

Have you tried tuning any perf parameters? From the volume options you 
have shared below, I see that there is scope to improve performance (for 
eg., by enabling md-cache parameters and parallel-readdir, metadata 
related operations latency can be improved). Request Poornima, Xavi or 
Du to comment on recommended values for better I/O throughput for your 
workload.

> 
> We would like to transition (smoothly!!) to gluster 5 or 6 with 
> nfs-ganesha 2.7 and see some performance improvements. We will be using 
> corosync and pacemaker for NFS failover. It would be fantastic be able 
> to saturate a 10G IPoIB (or 40G IB !) connection to each compute node in 
> the current computational cluster. Right now we absolutely can't get 
> much write speed ( copy a 6.2GB file from host to gluster storage took 
> 1m 21s. cp from disk to /dev/null is 7s). cp from gluster to /dev/null 
> is 1.0m (same 6.2GB file). That's a 10Gbps IPoIB connection at only 800Mbps.

Few things to note here -
* The volume option "nfs.disable" command refers to GluscterNFS service 
which is being deprecated and not enabled by default in the latest 
gluster versions available (like in gluster 5 & 6). We recommend 
NFS-Ganesha and hence this option needs to be turned on (to disable 
GlusterNFS)

* Starting from Gluster 3.11 , HA configuration bits for NFS-Ganesha 
have been removed from gluster codebase. So you would need to either 
manually configure any HA service on top of NFS-Ganesha servers or use 
storhaug [1] to configure the same.

* Coming to technical aspects, by switching to 'NFS', you could benefit 
from heavy caching done by NFS client and few other optimizations it 
does. Even NFS-Ganesha server does metadata caching and resides on the 
same nodes as the glusterfs servers. Apart from these, NFS-Ganesha acts 
like any other glusterfs client (but by making use of libgfapi and not 
fuse mount). It would be interesting to check if and how much 
improvement you get with 'NFS' when compared to fuse protocol for your 
workload. Please let us know when you have the test environment ready. 
Will make  recommendations wrt to few settings for NFS-Ganesha server 
and client.

Thanks,
Soumya

[1] https://github.com/linux-ha-storage/storhaug
> 
> We would like to do things like enable SSL encryption of all data flows 
> (we deal with PHI data in a HIPAA-regulated setting) but are concerned 
> about performance. We are running dual Intel Xeon  E5-2630L (12 physical 
> cores each @ 2.4GHz) and 128GB RAM in each server node. We have 170 
> users. About 20 are active at any time.
> 
> The current setting on /home (others are similar if not identical, maybe 
> nfs-disable is true for others):
> 
> gluster volume get home all
> Option                                  Value
> ------                                  -----
> cluster.lookup-unhashed                 on
> cluster.lookup-optimize                 off
> cluster.min-free-disk                   10%
> cluster.min-free-inodes                 5%
> cluster.rebalance-stats                 off
> cluster.subvols-per-directory           (null)
> cluster.readdir-optimize                off
> cluster.rsync-hash-regex                (null)
> cluster.extra-hash-regex                (null)
> cluster.dht-xattr-name                  trusted.glusterfs.dht
> cluster.randomize-hash-range-by-gfid    off
> cluster.rebal-throttle                  normal
> cluster.lock-migration                  off
> cluster.local-volume-name               (null)
> cluster.weighted-rebalance              on
> cluster.switch-pattern                  (null)
> cluster.entry-change-log                on
> cluster.read-subvolume                  (null)
> cluster.read-subvolume-index            -1
> cluster.read-hash-mode                  1
> cluster.background-self-heal-count      8
> cluster.metadata-self-heal              on
> cluster.data-self-heal                  on
> cluster.entry-self-heal                 on
> cluster.self-heal-daemon                enable
> cluster.heal-timeout                    600
> cluster.self-heal-window-size           1
> cluster.data-change-log                 on
> cluster.metadata-change-log             on
> cluster.data-self-heal-algorithm        (null)
> cluster.eager-lock                      on
> disperse.eager-lock                     on
> cluster.quorum-type                     none
> cluster.quorum-count                    (null)
> cluster.choose-local                    true
> cluster.self-heal-readdir-size          1KB
> cluster.post-op-delay-secs              1
> cluster.ensure-durability               on
> cluster.consistent-metadata             no
> cluster.heal-wait-queue-length          128
> cluster.favorite-child-policy           none
> cluster.stripe-block-size               128KB
> cluster.stripe-coalesce                 true
> diagnostics.latency-measurement         off
> diagnostics.dump-fd-stats               off
> diagnostics.count-fop-hits              off
> diagnostics.brick-log-level             INFO
> diagnostics.client-log-level            INFO
> diagnostics.brick-sys-log-level         CRITICAL
> diagnostics.client-sys-log-level        CRITICAL
> diagnostics.brick-logger                (null)
> diagnostics.client-logger               (null)
> diagnostics.brick-log-format            (null)
> diagnostics.client-log-format           (null)
> diagnostics.brick-log-buf-size          5
> diagnostics.client-log-buf-size         5
> diagnostics.brick-log-flush-timeout     120
> diagnostics.client-log-flush-timeout    120
> diagnostics.stats-dump-interval         0
> diagnostics.fop-sample-interval         0
> diagnostics.stats-dump-format           json
> diagnostics.fop-sample-buf-size         65535
> diagnostics.stats-dnscache-ttl-sec      86400
> performance.cache-max-file-size         0
> performance.cache-min-file-size         0
> performance.cache-refresh-timeout       1
> performance.cache-priority
> performance.cache-size                  32MB
> performance.io-thread-count             16
> performance.high-prio-threads           16
> performance.normal-prio-threads         16
> performance.low-prio-threads            16
> performance.least-prio-threads          1
> performance.enable-least-priority       on
> performance.cache-size                  128MB
> performance.flush-behind                on
> performance.nfs.flush-behind            on
> performance.write-behind-window-size    1MB
> performance.resync-failed-syncs-after-fsyncoff
> performance.nfs.write-behind-window-size1MB
> performance.strict-o-direct             off
> performance.nfs.strict-o-direct         off
> performance.strict-write-ordering       off
> performance.nfs.strict-write-ordering   off
> performance.lazy-open                   yes
> performance.read-after-open             no
> performance.read-ahead-page-count       4
> performance.md-cache-timeout            1
> performance.cache-swift-metadata        true
> performance.cache-samba-metadata        false
> performance.cache-capability-xattrs     true
> performance.cache-ima-xattrs            true
> features.encryption                     off
> encryption.master-key                   (null)
> encryption.data-key-size                256
> encryption.block-size                   4096
> network.frame-timeout                   1800
> network.ping-timeout                    42
> network.tcp-window-size                 (null)
> features.lock-heal                      off
> features.grace-timeout                  10
> network.remote-dio                      disable
> client.event-threads                    2
> client.tcp-user-timeout                 0
> client.keepalive-time                   20
> client.keepalive-interval               2
> client.keepalive-count                  9
> network.tcp-window-size                 (null)
> network.inode-lru-limit                 16384
> auth.allow                              *
> auth.reject                             (null)
> transport.keepalive                     1
> server.allow-insecure                   (null)
> server.root-squash                      off
> server.anonuid                          65534
> server.anongid                          65534
> server.statedump-path                   /var/run/gluster
> server.outstanding-rpc-limit            64
> features.lock-heal                      off
> features.grace-timeout                  10
> server.ssl                              (null)
> auth.ssl-allow                          *
> server.manage-gids                      off
> server.dynamic-auth                     on
> client.send-gids                        on
> server.gid-timeout                      300
> server.own-thread                       (null)
> server.event-threads                    1
> server.tcp-user-timeout                 0
> server.keepalive-time                   20
> server.keepalive-interval               2
> server.keepalive-count                  9
> transport.listen-backlog                10
> ssl.own-cert                            (null)
> ssl.private-key                         (null)
> ssl.ca-list                             (null)
> ssl.crl-path                            (null)
> ssl.certificate-depth                   (null)
> ssl.cipher-list                         (null)
> ssl.dh-param                            (null)
> ssl.ec-curve                            (null)
> performance.write-behind                on
> performance.read-ahead                  on
> performance.readdir-ahead               off
> performance.io-cache                    on
> performance.quick-read                  on
> performance.open-behind                 on
> performance.nl-cache                    off
> performance.stat-prefetch               on
> performance.client-io-threads           off
> performance.nfs.write-behind            on
> performance.nfs.read-ahead              off
> performance.nfs.io-cache                off
> performance.nfs.quick-read              off
> performance.nfs.stat-prefetch           off
> performance.nfs.io-threads              off
> performance.force-readdirp              true
> performance.cache-invalidation          false
> features.uss                            off
> features.snapshot-directory             .snaps
> features.show-snapshot-directory        off
> network.compression                     off
> network.compression.window-size         -15
> network.compression.mem-level           8
> network.compression.min-size            0
> network.compression.compression-level   -1
> network.compression.debug               false
> features.limit-usage                    (null)
> features.default-soft-limit             80%
> features.soft-timeout                   60
> features.hard-timeout                   5
> features.alert-time                     86400
> features.quota-deem-statfs              off
> geo-replication.indexing                off
> geo-replication.indexing                off
> geo-replication.ignore-pid-check        off
> geo-replication.ignore-pid-check        off
> features.quota                          off
> features.inode-quota                    off
> features.bitrot                         disable
> debug.trace                             off
> debug.log-history                       no
> debug.log-file                          no
> debug.exclude-ops                       (null)
> debug.include-ops                       (null)
> debug.error-gen                         off
> debug.error-failure                     (null)
> debug.error-number                      (null)
> debug.random-failure                    off
> debug.error-fops                        (null)
> nfs.enable-ino32                        no
> nfs.mem-factor                          15
> nfs.export-dirs                         on
> nfs.export-volumes                      on
> nfs.addr-namelookup                     off
> nfs.dynamic-volumes                     off
> nfs.register-with-portmap               on
> nfs.outstanding-rpc-limit               16
> nfs.port                                2049
> nfs.rpc-auth-unix                       on
> nfs.rpc-auth-null                       on
> nfs.rpc-auth-allow                      all
> nfs.rpc-auth-reject                     none
> nfs.ports-insecure                      off
> nfs.trusted-sync                        off
> nfs.trusted-write                       off
> nfs.volume-access                       read-write
> nfs.export-dir
> nfs.disable                             off
> nfs.nlm                                 on
> nfs.acl                                 on
> nfs.mount-udp                           off
> nfs.mount-rmtab                         /var/lib/glusterd/nfs/rmtab
> nfs.rpc-statd                           /sbin/rpc.statd
> nfs.server-aux-gids                     off
> nfs.drc                                 off
> nfs.drc-size                            0x20000
> nfs.read-size                           (1 * 1048576ULL)
> nfs.write-size                          (1 * 1048576ULL)
> nfs.readdir-size                        (1 * 1048576ULL)
> nfs.rdirplus                            on
> nfs.exports-auth-enable                 (null)
> nfs.auth-refresh-interval-sec           (null)
> nfs.auth-cache-ttl-sec                  (null)
> features.read-only                      off
> features.worm                           off
> features.worm-file-level                off
> features.default-retention-period       120
> features.retention-mode                 relax
> features.auto-commit-period             180
> storage.linux-aio                       off
> storage.batch-fsync-mode                reverse-fsync
> storage.batch-fsync-delay-usec          0
> storage.owner-uid                       -1
> storage.owner-gid                       -1
> storage.node-uuid-pathinfo              off
> storage.health-check-interval           30
> storage.build-pgfid                     on
> storage.gfid2path                       on
> storage.gfid2path-separator             :
> storage.bd-aio                          off
> cluster.server-quorum-type              off
> cluster.server-quorum-ratio             0
> changelog.changelog                     off
> changelog.changelog-dir                 (null)
> changelog.encoding                      ascii
> changelog.rollover-time                 15
> changelog.fsync-interval                5
> changelog.changelog-barrier-timeout     120
> changelog.capture-del-path              off
> features.barrier                        disable
> features.barrier-timeout                120
> features.trash                          off
> features.trash-dir                      .trashcan
> features.trash-eliminate-path           (null)
> features.trash-max-filesize             5MB
> features.trash-internal-op              off
> cluster.enable-shared-storage           disable
> cluster.write-freq-threshold            0
> cluster.read-freq-threshold             0
> cluster.tier-pause                      off
> cluster.tier-promote-frequency          120
> cluster.tier-demote-frequency           3600
> cluster.watermark-hi                    90
> cluster.watermark-low                   75
> cluster.tier-mode                       cache
> cluster.tier-max-promote-file-size      0
> cluster.tier-max-mb                     4000
> cluster.tier-max-files                  10000
> cluster.tier-query-limit                100
> cluster.tier-compact                    on
> cluster.tier-hot-compact-frequency      604800
> cluster.tier-cold-compact-frequency     604800
> features.ctr-enabled                    off
> features.record-counters                off
> features.ctr-record-metadata-heat       off
> features.ctr_link_consistency           off
> features.ctr_lookupheal_link_timeout    300
> features.ctr_lookupheal_inode_timeout   300
> features.ctr-sql-db-cachesize           12500
> features.ctr-sql-db-wal-autocheckpoint  25000
> features.selinux                        on
> locks.trace                             off
> locks.mandatory-locking                 off
> cluster.disperse-self-heal-daemon       enable
> cluster.quorum-reads                    no
> client.bind-insecure                    (null)
> features.shard                          off
> features.shard-block-size               64MB
> features.scrub-throttle                 lazy
> features.scrub-freq                     biweekly
> features.scrub                          false
> features.expiry-time                    120
> features.cache-invalidation             off
> features.cache-invalidation-timeout     60
> features.leases                         off
> features.lease-lock-recall-timeout      60
> disperse.background-heals               8
> disperse.heal-wait-qlength              128
> cluster.heal-timeout                    600
> dht.force-readdirp                      on
> disperse.read-policy                    gfid-hash
> cluster.shd-max-threads                 1
> cluster.shd-wait-qlength                1024
> cluster.locking-scheme                  full
> cluster.granular-entry-heal             no
> features.locks-revocation-secs          0
> features.locks-revocation-clear-all     false
> features.locks-revocation-max-blocked   0
> features.locks-monkey-unlocking         false
> disperse.shd-max-threads                1
> disperse.shd-wait-qlength               1024
> disperse.cpu-extensions                 auto
> disperse.self-heal-window-size          1
> cluster.use-compound-fops               off
> performance.parallel-readdir            off
> performance.rda-request-size            131072
> performance.rda-low-wmark               4096
> performance.rda-high-wmark              128KB
> performance.rda-cache-limit             10MB
> performance.nl-cache-positive-entry     false
> performance.nl-cache-limit              10MB
> performance.nl-cache-timeout            60
> cluster.brick-multiplex                 off
> cluster.max-bricks-per-process          0
> disperse.optimistic-change-log          on
> cluster.halo-enabled                    False
> cluster.halo-shd-max-latency            99999
> cluster.halo-nfsd-max-latency           5
> cluster.halo-max-latency                5
> cluster.halo-max-replicas
>>
>> Thanks,
>> Soumya
>>
>>>
>>>      Does this process make more sense than a version upgrade process to
>>>      4.1, then 5, then 6? What "gotcha's" do I need to be ready for? I
>>>      have until late May to prep and test on old, slow hardware with a
>>>      small amount of files and volumes.
>>>
>>>
>>> You can directly upgrade from 3.12 to 6.x. I would suggest that rather
>>> than deleting and creating Gluster volume. +Hari and +Sanju for further
>>> guidelines on upgrade, as they recently did upgrade tests. +Soumya to
>>> add to the nfs-ganesha aspect.
>>>
>>> Regards,
>>> Poornima
>>>
>>>      --
>>>
>>>      James P. Kinney III
>>>
>>>      Every time you stop a school, you will have to build a jail. What you
>>>      gain at one end you lose at the other. It's like feeding a dog on his
>>>      own tail. It won't fatten the dog.
>>>      - Speech 11/23/1900 Mark Twain
>>>
>>>      
>>> http://heretothereideas.blogspot.com/
>>>
>>>
>>>      _______________________________________________
>>>      Gluster-users mailing list
>>>      
>>> Gluster-users at gluster.org
>>>  <mailto:Gluster-users at gluster.org>
>>>   <mailto:
>>> Gluster-users at gluster.org
>>>  <mailto:Gluster-users at gluster.org>
>>> >
>>>      
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
> -- 
> 
> James P. Kinney III
> 
> Every time you stop a school, you will have to build a jail. What you
> gain at one end you lose at the other. It's like feeding a dog on his
> own tail. It won't fatten the dog.
> - Speech 11/23/1900 Mark Twain
> 
> http://heretothereideas.blogspot.com/
>