[Gluster-users] upgrade best practices
Jim Kinney
jim.kinney at gmail.com
Mon Apr 1 16:15:10 UTC 2019
On Sun, 2019-03-31 at 23:01 +0530, Soumya Koduri wrote:
> On 3/29/19 10:39 PM, Poornima Gurusiddaiah wrote:
> > On Fri, Mar 29, 2019, 10:03 PM Jim Kinney <jim.kinney at gmail.com
> > <mailto:jim.kinney at gmail.com>> wrote:
> > Currently running 3.12 on Centos 7.6. Doing cleanups on split-
> > brain and out of sync, need heal files.
> > We need to migrate the three replica servers to gluster v. 5 or
> > 6. Also will need to upgrade about 80 clients as well. Given
> > that a complete removal of gluster will not touch the 200+TB of
> > data on 12 volumes, we are looking at doing that process, Stop
> > all clients, stop all glusterd services, remove all of it,
> > install new version, setup new volumes from old bricks, install
> > new clients, mount everything.
> > We would like to get some better performance from nfs-ganesha
> > mounts but that doesn't look like an option (not done any
> > parameter tweaks in testing yet). At a bare minimum, we would
> > like to minimize the total downtime of all systems.
>
> Could you please be more specific here? As in are you looking for
> better performance during upgrade process or in general? Compared to
> 3.12, there are lot of perf improvements done in both glusterfs and
> esp., nfs-ganesha (latest stable - V2.7.x) stack. If you could
> provide more information about your workloads (for eg., large-
> file,small-files, metadata-intensive) , we can make some
> recommendations wrt to configuration.
Sure. More details:
We are (soon to be) running a three-node replica only gluster service
(2 nodes now, third is racked and ready for sync and being added to
gluster cluster). Each node has 2 external drive arrays plus one
internal. Each node has 40G IB plus 40G IP connections (plans to
upgrade to 100G). We currently have 9 volumes and each is 7TB up to
50TB of space. Each volume is a mix of thousands of large (>1GB) and
tens of thousands of small (~100KB) plus thousands inbetween.
Currently we have a 13-node computational cluster with varying GPU
abilities that mounts all of these volumes using gluster-fuse. Writes
are slow and reads are also as if from a single server. I have data
from a test setup (not anywhere near the capacity of the production
system - just for testing commands and recoveries) that indicates raw
NFS is much faster but no gluster, gluster-fuse is much slower. We have
mmap issues with python and fuse-mounted locations. Converting to NFS
solves this. We have tinkered with kernel settings to handle oom-killer
so it will no longer drop glusterfs when an errant job eat all the ram
(set oom_score_adj - -1000 for all glusterfs pids).
We would like to transition (smoothly!!) to gluster 5 or 6 with nfs-
ganesha 2.7 and see some performance improvements. We will be using
corosync and pacemaker for NFS failover. It would be fantastic be able
to saturate a 10G IPoIB (or 40G IB !) connection to each compute node
in the current computational cluster. Right now we absolutely can't get
much write speed ( copy a 6.2GB file from host to gluster storage took
1m 21s. cp from disk to /dev/null is 7s). cp from gluster to /dev/null
is 1.0m (same 6.2GB file). That's a 10Gbps IPoIB connection at only
800Mbps.
We would like to do things like enable SSL encryption of all data flows
(we deal with PHI data in a HIPAA-regulated setting) but are concerned
about performance. We are running dual Intel Xeon E5-2630L (12
physical cores each @ 2.4GHz) and 128GB RAM in each server node. We
have 170 users. About 20 are active at any time.
The current setting on /home (others are similar if not identical,
maybe nfs-disable is true for others):
gluster volume get home
allOption Value
------ --
--- cluster.lookup-
unhashed on cluste
r.lookup-
optimize off cluste
r.min-free-
disk 10% cluster.
min-free-
inodes 5% cluster.
rebalance-
stats off cluster.s
ubvols-per-
directory (null) cluster.rea
ddir-
optimize off cluster
.rsync-hash-
regex (null) cluster.ex
tra-hash-
regex (null) cluster.dh
t-xattr-
name trusted.glusterfs.dht cluster.r
andomize-hash-range-by-
gfid off cluster.rebal-
throttle normal clust
er.lock-
migration off clus
ter.local-volume-
name (null) cluster.weig
hted-
rebalance on cluster.
switch-
pattern (null) cluste
r.entry-change-
log on cluster.read
-subvolume (null) clu
ster.read-subvolume-index -
1 cluster.read-hash-
mode 1 cluster.b
ackground-self-heal-
count 8 cluster.metadata-
self-
heal on cluster.data-
self-
heal on cluster.e
ntry-self-
heal on cluster.se
lf-heal-
daemon enable cluster.h
eal-
timeout 600 clus
ter.self-heal-window-
size 1 cluster.data-
change-
log on cluster.met
adata-change-
log on cluster.data-
self-heal-
algorithm (null) cluster.eager-
lock on dispe
rse.eager-
lock on cluste
r.quorum-
type none cluste
r.quorum-
count (null) cluste
r.choose-
local true cluste
r.self-heal-readdir-
size 1KB cluster.post-op-
delay-
secs 1 cluster.ensur
e-
durability on cluste
r.consistent-
metadata no cluster.he
al-wait-queue-
length 128 cluster.favorit
e-child-
policy none cluster.stripe
-block-
size 128KB cluster.stri
pe-
coalesce true diagno
stics.latency-
measurement off diagnostics
.dump-fd-
stats off diagnostics
.count-fop-
hits off diagnostics.b
rick-log-
level INFO diagnostics.c
lient-log-
level INFO diagnostics.br
ick-sys-log-
level CRITICAL diagnostics.clien
t-sys-log-
level CRITICAL diagnostics.brick-
logger (null) diagnosti
cs.client-
logger (null) diagnostic
s.brick-log-
format (null) diagnostics.c
lient-log-
format (null) diagnostics.br
ick-log-buf-
size 5 diagnostics.clien
t-log-buf-
size 5 diagnostics.brick-
log-flush-
timeout 120 diagnostics.client-
log-flush-
timeout 120 diagnostics.stats-
dump-
interval 0 diagnostics.fo
p-sample-
interval 0 diagnostics.st
ats-dump-
format json diagnostics.fo
p-sample-buf-
size 65535 diagnostics.stats-
dnscache-ttl-
sec 86400 performance.cache-max-
file-
size 0 performance.cache-
min-file-
size 0 performance.cache-
refresh-
timeout 1 performance.cache
-priority performa
nce.cache-
size 32MB performan
ce.io-thread-
count 16 performance.h
igh-prio-
threads 16 performance.n
ormal-prio-
threads 16 performance.low
-prio-
threads 16 performance.
least-prio-
threads 1 performance.en
able-least-
priority on performance.cach
e-
size 128MB performan
ce.flush-
behind on performan
ce.nfs.flush-
behind on performance.w
rite-behind-window-
size 1MB performance.resync-
failed-syncs-after-
fsyncoff performance.nfs.write-
behind-window-
size1MB performance.strict-o-
direct off performance.
nfs.strict-o-
direct off performance.stri
ct-write-
ordering off performance.nfs.
strict-write-
ordering off performance.lazy-
open yes performa
nce.read-after-
open no performance.re
ad-ahead-page-
count 4 performance.md-
cache-
timeout 1 performance.
cache-swift-
metadata true performance.cac
he-samba-
metadata false performance.cac
he-capability-
xattrs true performance.cache-
ima-
xattrs true features.encr
yption off encr
yption.master-
key (null) encryptio
n.data-key-
size 256 encryption.
block-
size 4096 network.
frame-
timeout 1800 netwo
rk.ping-
timeout 42 netw
ork.tcp-window-
size (null) features.l
ock-
heal off featu
res.grace-
timeout 10 networ
k.remote-
dio disable client
.event-
threads 2 clie
nt.tcp-user-
timeout 0 client.
keepalive-
time 20 client.k
eepalive-
interval 2 client.k
eepalive-
count 9 network.
tcp-window-
size (null) network.in
ode-lru-
limit 16384 auth.allo
w *
auth.reject (null)
transport.keepalive 1
server.allow-
insecure (null) serv
er.root-
squash off ser
ver.anonuid 65534
server.anongid 65534
server.statedump-
path /var/run/gluster server.o
utstanding-rpc-
limit 64 features.lock-
heal off featu
res.grace-
timeout 10 server
.ssl (null)
auth.ssl-
allow *
server.manage-
gids off serve
r.dynamic-
auth on client
.send-
gids on ser
ver.gid-
timeout 300 se
rver.own-
thread (null) se
rver.event-
threads 1 serv
er.tcp-user-
timeout 0 server.
keepalive-
time 20 server.k
eepalive-
interval 2 server.k
eepalive-
count 9 transpor
t.listen-
backlog 10 ssl.own-
cert (null)
ssl.private-
key (null) ssl
.ca-
list (null)
ssl.crl-
path (null)
ssl.certificate-
depth (null) ssl.cip
her-
list (null) ss
l.dh-
param (null)
ssl.ec-
curve (null)
performance.write-
behind on performan
ce.read-
ahead on performa
nce.readdir-
ahead off performance
.io-
cache on perfor
mance.quick-
read on performan
ce.open-
behind on performa
nce.nl-
cache off perfor
mance.stat-
prefetch on performa
nce.client-io-
threads off performance.n
fs.write-
behind on performance.n
fs.read-
ahead off performance.
nfs.io-
cache off performanc
e.nfs.quick-
read off performance.n
fs.stat-
prefetch off performance.
nfs.io-
threads off performanc
e.force-
readdirp true performan
ce.cache-
invalidation false features.
uss off
features.snapshot-
directory .snaps features.
show-snapshot-
directory off network.compre
ssion off netwo
rk.compression.window-size -
15 network.compression.mem-
level 8 network.compres
sion.min-
size 0 network.compres
sion.compression-level -
1 network.compression.debug
false features.limit-
usage (null) featur
es.default-soft-
limit 80% features.soft
-timeout 60 feat
ures.hard-
timeout 5 featu
res.alert-
time 86400 featur
es.quota-deem-
statfs off geo-
replication.indexing off
geo-
replication.indexing off
geo-replication.ignore-pid-
check off geo-
replication.ignore-pid-
check off features.quota
off features.
inode-
quota off featur
es.bitrot disable
debug.trace off
debug.log-
history no d
ebug.log-
file no d
ebug.exclude-
ops (null) debug
.include-
ops (null) debug
.error-
gen off deb
ug.error-
failure (null) deb
ug.error-
number (null) deb
ug.random-
failure off debu
g.error-
fops (null) nfs
.enable-
ino32 no nf
s.mem-
factor 15
nfs.export-
dirs on nf
s.export-
volumes on nf
s.addr-
namelookup off
nfs.dynamic-
volumes off nfs
.register-with-
portmap on nfs.outst
anding-rpc-
limit 16 nfs.port
2049 nf
s.rpc-auth-
unix on nfs.
rpc-auth-
null on nfs.
rpc-auth-
allow all nfs.
rpc-auth-
reject none nfs.
ports-
insecure off n
fs.trusted-
sync off nfs
.trusted-
write off nfs
.volume-access read-
write nfs.export-
dir nf
s.disable off
nfs.nlm on
nfs.acl on
nfs.mount-
udp off n
fs.mount-
rmtab /var/lib/glusterd/nfs/rmtab n
fs.rpc-
statd /sbin/rpc.statd
nfs.server-aux-
gids off nfs.dr
c off
nfs.drc-
size 0x20000
nfs.read-size (1 *
1048576ULL) nfs.write-
size (1 *
1048576ULL) nfs.readdir-
size (1 *
1048576ULL) nfs.rdirplus
on nfs.exports-auth-
enable (null) nfs.auth
-refresh-interval-
sec (null) nfs.auth-cache-
ttl-
sec (null) features.r
ead-
only off featu
res.worm off
features.worm-file-
level off features.d
efault-retention-
period 120 features.retention
-mode relax features.
auto-commit-
period 180 storage.linu
x-
aio off stora
ge.batch-fsync-mode reverse-
fsync storage.batch-fsync-delay-
usec 0 storage.owner-
uid -
1 storage.owner-
gid -
1 storage.node-uuid-
pathinfo off storage.h
ealth-check-
interval 30 storage.buil
d-
pgfid on stora
ge.gfid2path on
storage.gfid2path-
separator : storage.b
d-
aio off cl
uster.server-quorum-
type off cluster.serve
r-quorum-
ratio 0 changelog.cha
ngelog off chan
gelog.changelog-
dir (null) changelog.e
ncoding ascii ch
angelog.rollover-
time 15 changelog.
fsync-
interval 5 changel
og.changelog-barrier-
timeout 120 changelog.capture-
del-
path off features.barr
ier disable feat
ures.barrier-
timeout 120 features
.trash off
features.trash-
dir .trashcan featur
es.trash-eliminate-
path (null) features.trash-
max-
filesize 5MB features.t
rash-internal-
op off cluster.enable-
shared-
storage disable cluster.write
-freq-
threshold 0 cluster.re
ad-freq-
threshold 0 cluster.t
ier-
pause off clus
ter.tier-promote-
frequency 120 cluster.tier
-demote-
frequency 3600 cluster.wat
ermark-
hi 90 cluster.w
atermark-
low 75 cluster.t
ier-
mode cache clus
ter.tier-max-promote-file-
size 0 cluster.tier-max-
mb 4000 cluster.
tier-max-
files 10000 cluster.
tier-query-
limit 100 cluster.ti
er-
compact on clus
ter.tier-hot-compact-
frequency 604800 cluster.tier-
cold-compact-
frequency 604800 features.ctr-
enabled off feat
ures.record-
counters off feature
s.ctr-record-metadata-
heat off features.ctr_link_co
nsistency off features.ct
r_lookupheal_link_timeout 300 fe
atures.ctr_lookupheal_inode_timeout 300
features.ctr-sql-db-
cachesize 12500 features.ct
r-sql-db-wal-
autocheckpoint 25000 features.selinu
x on locks.
trace off
locks.mandatory-
locking off cluster
.disperse-self-heal-
daemon enable cluster.quorum-
reads no client
.bind-
insecure (null) fea
tures.shard off
features.shard-block-
size 64MB features.scr
ub-
throttle lazy featur
es.scrub-
freq biweekly featur
es.scrub false
features.expiry-
time 120 feature
s.cache-
invalidation off featur
es.cache-invalidation-
timeout 60 features.leases
off features.l
ease-lock-recall-
timeout 60 disperse.backgroun
d-
heals 8 disperse.he
al-wait-
qlength 128 cluster.he
al-
timeout 600 dht.
force-
readdirp on d
isperse.read-policy gfid-
hash cluster.shd-max-
threads 1 cluster
.shd-wait-
qlength 1024 cluster.
locking-
scheme full cluster
.granular-entry-
heal no features.locks
-revocation-
secs 0 features.locks-
revocation-clear-
all false features.locks-
revocation-max-
blocked 0 features.locks-
monkey-
unlocking false disperse.shd-
max-
threads 1 disperse
.shd-wait-
qlength 1024 disperse.
cpu-
extensions auto disp
erse.self-heal-window-
size 1 cluster.use-
compound-
fops off performance.
parallel-
readdir off performance.
rda-request-
size 131072 performance.rda
-low-
wmark 4096 performance
.rda-high-
wmark 128KB performance.
rda-cache-
limit 10MB performance.n
l-cache-positive-
entry false performance.nl-cache-
limit 10MB performance.
nl-cache-
timeout 60 cluster.bric
k-
multiplex off clust
er.max-bricks-per-
process 0 disperse.optim
istic-change-
log on cluster.halo-
enabled False clus
ter.halo-shd-max-
latency 99999 cluster.halo
-nfsd-max-
latency 5 cluster.halo-
max-
latency 5 cluster.
halo-max-replicas
> Thanks,Soumya
> > Does this process make more sense than a version upgrade
> > process to 4.1, then 5, then 6? What "gotcha's" do I need to be
> > ready for? I have until late May to prep and test on old, slow
> > hardware with a small amount of files and volumes.
> >
> > You can directly upgrade from 3.12 to 6.x. I would suggest that
> > rather than deleting and creating Gluster volume. +Hari and +Sanju
> > for further guidelines on upgrade, as they recently did upgrade
> > tests. +Soumya to add to the nfs-ganesha aspect.
> > Regards,Poornima
> > --
> > James P. Kinney III
> > Every time you stop a school, you will have to build a jail.
> > What you gain at one end you lose at the other. It's like
> > feeding a dog on his own tail. It won't fatten the dog. -
> > Speech 11/23/1900 Mark Twain
> > http://heretothereideas.blogspot.com/
> >
> > _______________________________________________ Gluster-
> > users mailing list Gluster-users at gluster.org <mailto:Gluster-
> > users at gluster.org>
> > https://lists.gluster.org/mailman/listinfo/gluster-users
> >
--
James P. Kinney III
Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain
http://heretothereideas.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190401/afd63681/attachment.html>
More information about the Gluster-users
mailing list