[Gluster-users] split-brain errors under heavy load when one brick down
Erik Jacobson
erik.jacobson at hpe.com
Mon Sep 16 14:04:25 UTC 2019
Hello all. I'm new to the list but not to gluster.
We are using gluster to service NFS boot on a top500 cluster. It is a
Distributed-Replicate volume 3x9.
We are having a problem when one server in a subvolume goes down, we get
random missing files and split-brain errors in the nfs.log file.
We are using Gluster NFS (We are interested in switching to Ganesha but
this workload presents problems there that we need to work through yet).
Unfortunately, like many such large systems, I am unable to take much
out of the system for debugging and unable to take the system down to
test this very often. However, my hope is to be well prepared when the
next large system comes through the factory so I can try to reproduce
this issue or have some things to try.
In the lab, I have a test system that is also a 3x9 setup like at the
customer site, but with only 3 compute nodes instead of 2,592 compute
nodes. We use CTDB for IP alias management - the compute nodes connect
to NFS with the alias.
Here is the issue we are having:
- 2592 nodes all PXE-booting at once and using the Gluster servers as
their NFS root is working great. This includes when one subvolume is
degraded due to the loss of a server. No issues at boot, no split-brain
messages in the log.
- The problem comes in when we do an intensive job launch. This launch
uses SLURM and then loads hundreds of shared libraries over NFS across
all 2592 nodes.
- When all servers in the 3x9 pool are up, we're in good shape - no
issues on the compute nodes, no split-brain messages in the log.
- When one subvolume has one missing server (its ethernet adapters
died), while we boot fine, the SLURM launch has random missing files.
Gluster nfs.log shows split-brain messages and ACCESS I/O errors.
- Taking an example failed file and accessing it across all compute nodes
always works afterwords, the issue is transient.
- The missing file is always found in the other bricks in the subvolume by
searching there is well
- No FS/disk IO errors in the logs or dmesg and the files are accessible
before and after the transient error (and from the bricks themselves as I
said).
- The customer jobs fail to launch, then, if we are degraded. They fail
with library read errors, missing config files, etc.
What is perplexing is the huge load of 2592 nodes with NFS roots
PXE-booting does not trigger the issue when one subvolume is degraded.
Thank you for reading this far and thanks to the community for
making Gluster!!
Example errors:
ex1
[2019-09-06 18:26:42.665050] E [MSGID: 108008]
[afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing
ACCESS on gfid ee3f5646-9368-4151-92a3-5b8e7db1fbf9: split-brain observed.
[Input/output error]
ex2
[2019-09-06 18:26:55.359272] E [MSGID: 108008]
[afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing
READLINK on gfid f2be38c2-1cd1-486b-acad-17f2321a18b3: split-brain observed.
[Input/output error]
[2019-09-06 18:26:55.359367] W [MSGID: 112199]
[nfs3-helpers.c:3435:nfs3_log_readlink_res] 0-nfs-nfsv3:
/image/images_ro_nfs/toss-20190730/usr/lib64/libslurm.so.32 => (XID: 88651c80,
READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null)
The errors seem to happen only on the 'replicate' volume where one
server is down in the subvolume (of course, any NFS server will
trigger that when it accesses the files on the degraded volume).
Now, I am no longer able to access this customer system and it is moving
to more secret work so I can't easily run tests on such a big system
until we have something come through the factory. However, I'm desperate
for help and would like a bag of tricks to attack this with next time I
can hit it. Having the HA stuff fail when needed has given me a bit of a
black eye on the solution. I had a lesson learned in being sure to test
the HA solution. I had tested many times at full system boot but didn't
think to do job launch tests while degraded in my testing. That pain
will haunt me but also make me better.
Info on the volumes:
- RHEL 7.6 x86_64 Gluster/GNFS servers
- gluster version 4.1.6, I set up the build
- Clients are AARCH64 NFS 3 clients (technically configured with RO NFS
(Using a version of Linux somewhat like CentOS 7.6).
- The base filesystems for bricks are XFS and NO LVM layer.
What follows is the volume info from my test system in the lab, which
has the same versions and setup. I cannot get this info from the
customer without an approval process but the same scripts and tools set
up my test system so I'm confident the settings are the same.
[root at leader1 ~]# gluster volume info
Volume Name: cm_shared
Type: Distributed-Replicate
Volume ID: e7f2796b-7a94-41ab-a07d-bdce4900c731
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 172.23.0.3:/data/brick_cm_shared
Brick2: 172.23.0.4:/data/brick_cm_shared
Brick3: 172.23.0.5:/data/brick_cm_shared
Brick4: 172.23.0.6:/data/brick_cm_shared
Brick5: 172.23.0.7:/data/brick_cm_shared
Brick6: 172.23.0.8:/data/brick_cm_shared
Brick7: 172.23.0.9:/data/brick_cm_shared
Brick8: 172.23.0.10:/data/brick_cm_shared
Brick9: 172.23.0.11:/data/brick_cm_shared
Options Reconfigured:
nfs.nlm: off
nfs.mount-rmtab: /-
performance.nfs.io-cache: on
performance.md-cache-statfs: off
performance.cache-refresh-timeout: 60
storage.max-hardlinks: 0
nfs.acl: on
nfs.outstanding-rpc-limit: 1024
server.outstanding-rpc-limit: 1024
performance.write-behind-window-size: 1024MB
transport.listen-backlog: 16384
performance.write-behind-trickling-writes: off
performance.aggregate-size: 2048KB
performance.flush-behind: on
cluster.lookup-unhashed: auto
performance.parallel-readdir: on
performance.cache-size: 8GB
performance.io-thread-count: 32
network.inode-lru-limit: 1000000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
server.event-threads: 32
client.event-threads: 32
cluster.lookup-optimize: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: false
performance.client-io-threads: on
Volume Name: ctdb
Type: Replicate
Volume ID: 5274a6ce-2ac9-4fc7-8145-dd2b8a97ff3b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 9 = 9
Transport-type: tcp
Bricks:
Brick1: 172.23.0.3:/data/brick_ctdb
Brick2: 172.23.0.4:/data/brick_ctdb
Brick3: 172.23.0.5:/data/brick_ctdb
Brick4: 172.23.0.6:/data/brick_ctdb
Brick5: 172.23.0.7:/data/brick_ctdb
Brick6: 172.23.0.8:/data/brick_ctdb
Brick7: 172.23.0.9:/data/brick_ctdb
Brick8: 172.23.0.10:/data/brick_ctdb
Brick9: 172.23.0.11:/data/brick_ctdb
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
Here is the setting detail on the cm_shared volume - the one used for
GNFS:
[root at leader1 ~]# gluster volume get cm_shared all
Option Value
------ -----
cluster.lookup-unhashed auto
cluster.lookup-optimize on
cluster.min-free-disk 10%
cluster.min-free-inodes 5%
cluster.rebalance-stats off
cluster.subvols-per-directory (null)
cluster.readdir-optimize off
cluster.rsync-hash-regex (null)
cluster.extra-hash-regex (null)
cluster.dht-xattr-name trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid off
cluster.rebal-throttle normal
cluster.lock-migration off
cluster.force-migration off
cluster.local-volume-name (null)
cluster.weighted-rebalance on
cluster.switch-pattern (null)
cluster.entry-change-log on
cluster.read-subvolume (null)
cluster.read-subvolume-index -1
cluster.read-hash-mode 1
cluster.background-self-heal-count 8
cluster.metadata-self-heal on
cluster.data-self-heal on
cluster.entry-self-heal on
cluster.self-heal-daemon on
cluster.heal-timeout 600
cluster.self-heal-window-size 1
cluster.data-change-log on
cluster.metadata-change-log on
cluster.data-self-heal-algorithm (null)
cluster.eager-lock on
disperse.eager-lock on
disperse.other-eager-lock on
disperse.eager-lock-timeout 1
disperse.other-eager-lock-timeout 1
cluster.quorum-type auto
cluster.quorum-count (null)
cluster.choose-local true
cluster.self-heal-readdir-size 1KB
cluster.post-op-delay-secs 1
cluster.ensure-durability on
cluster.consistent-metadata no
cluster.heal-wait-queue-length 128
cluster.favorite-child-policy none
cluster.full-lock yes
cluster.stripe-block-size 128KB
cluster.stripe-coalesce true
diagnostics.latency-measurement off
diagnostics.dump-fd-stats off
diagnostics.count-fop-hits off
diagnostics.brick-log-level INFO
diagnostics.client-log-level INFO
diagnostics.brick-sys-log-level CRITICAL
diagnostics.client-sys-log-level CRITICAL
diagnostics.brick-logger (null)
diagnostics.client-logger (null)
diagnostics.brick-log-format (null)
diagnostics.client-log-format (null)
diagnostics.brick-log-buf-size 5
diagnostics.client-log-buf-size 5
diagnostics.brick-log-flush-timeout 120
diagnostics.client-log-flush-timeout 120
diagnostics.stats-dump-interval 0
diagnostics.fop-sample-interval 0
diagnostics.stats-dump-format json
diagnostics.fop-sample-buf-size 65535
diagnostics.stats-dnscache-ttl-sec 86400
performance.cache-max-file-size 0
performance.cache-min-file-size 0
performance.cache-refresh-timeout 60
performance.cache-priority
performance.cache-size 8GB
performance.io-thread-count 32
performance.high-prio-threads 16
performance.normal-prio-threads 16
performance.low-prio-threads 16
performance.least-prio-threads 1
performance.enable-least-priority on
performance.iot-watchdog-secs (null)
performance.iot-cleanup-disconnected-reqsoff
performance.iot-pass-through false
performance.io-cache-pass-through false
performance.cache-size 8GB
performance.qr-cache-timeout 1
performance.cache-invalidation on
performance.flush-behind on
performance.nfs.flush-behind on
performance.write-behind-window-size 1024MB
performance.resync-failed-syncs-after-fsyncoff
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct off
performance.nfs.strict-o-direct off
performance.strict-write-ordering off
performance.nfs.strict-write-ordering off
performance.write-behind-trickling-writesoff
performance.aggregate-size 2048KB
performance.nfs.write-behind-trickling-writeson
performance.lazy-open yes
performance.read-after-open no
performance.open-behind-pass-through false
performance.read-ahead-page-count 4
performance.read-ahead-pass-through false
performance.readdir-ahead-pass-through false
performance.md-cache-pass-through false
performance.md-cache-timeout 600
performance.cache-swift-metadata true
performance.cache-samba-metadata false
performance.cache-capability-xattrs true
performance.cache-ima-xattrs true
performance.md-cache-statfs off
performance.xattr-cache-list
performance.nl-cache-pass-through false
features.encryption off
encryption.master-key (null)
encryption.data-key-size 256
encryption.block-size 4096
network.frame-timeout 1800
network.ping-timeout 42
network.tcp-window-size (null)
network.remote-dio disable
client.event-threads 32
client.tcp-user-timeout 0
client.keepalive-time 20
client.keepalive-interval 2
client.keepalive-count 9
network.tcp-window-size (null)
network.inode-lru-limit 1000000
auth.allow *
auth.reject (null)
transport.keepalive 1
server.allow-insecure on
server.root-squash off
server.anonuid 65534
server.anongid 65534
server.statedump-path /var/run/gluster
server.outstanding-rpc-limit 1024
server.ssl (null)
auth.ssl-allow *
server.manage-gids off
server.dynamic-auth on
client.send-gids on
server.gid-timeout 300
server.own-thread (null)
server.event-threads 32
server.tcp-user-timeout 0
server.keepalive-time 20
server.keepalive-interval 2
server.keepalive-count 9
transport.listen-backlog 16384
ssl.own-cert (null)
ssl.private-key (null)
ssl.ca-list (null)
ssl.crl-path (null)
ssl.certificate-depth (null)
ssl.cipher-list (null)
ssl.dh-param (null)
ssl.ec-curve (null)
transport.address-family inet
performance.write-behind on
performance.read-ahead on
performance.readdir-ahead on
performance.io-cache on
performance.quick-read on
performance.open-behind on
performance.nl-cache off
performance.stat-prefetch on
performance.client-io-threads on
performance.nfs.write-behind on
performance.nfs.read-ahead off
performance.nfs.io-cache on
performance.nfs.quick-read off
performance.nfs.stat-prefetch off
performance.nfs.io-threads off
performance.force-readdirp true
performance.cache-invalidation on
features.uss off
features.snapshot-directory .snaps
features.show-snapshot-directory off
features.tag-namespaces off
network.compression off
network.compression.window-size -15
network.compression.mem-level 8
network.compression.min-size 0
network.compression.compression-level -1
network.compression.debug false
features.default-soft-limit 80%
features.soft-timeout 60
features.hard-timeout 5
features.alert-time 86400
features.quota-deem-statfs off
geo-replication.indexing off
geo-replication.indexing off
geo-replication.ignore-pid-check off
geo-replication.ignore-pid-check off
features.quota off
features.inode-quota off
features.bitrot disable
debug.trace off
debug.log-history no
debug.log-file no
debug.exclude-ops (null)
debug.include-ops (null)
debug.error-gen off
debug.error-failure (null)
debug.error-number (null)
debug.random-failure off
debug.error-fops (null)
nfs.enable-ino32 no
nfs.mem-factor 15
nfs.export-dirs on
nfs.export-volumes on
nfs.addr-namelookup off
nfs.dynamic-volumes off
nfs.register-with-portmap on
nfs.outstanding-rpc-limit 1024
nfs.port 2049
nfs.rpc-auth-unix on
nfs.rpc-auth-null on
nfs.rpc-auth-allow all
nfs.rpc-auth-reject none
nfs.ports-insecure off
nfs.trusted-sync off
nfs.trusted-write off
nfs.volume-access read-write
nfs.export-dir
nfs.disable false
nfs.nlm off
nfs.acl on
nfs.mount-udp off
nfs.mount-rmtab /-
nfs.rpc-statd /sbin/rpc.statd
nfs.server-aux-gids off
nfs.drc off
nfs.drc-size 0x20000
nfs.read-size (1 * 1048576ULL)
nfs.write-size (1 * 1048576ULL)
nfs.readdir-size (1 * 1048576ULL)
nfs.rdirplus on
nfs.event-threads 1
nfs.exports-auth-enable (null)
nfs.auth-refresh-interval-sec (null)
nfs.auth-cache-ttl-sec (null)
features.read-only off
features.worm off
features.worm-file-level off
features.worm-files-deletable on
features.default-retention-period 120
features.retention-mode relax
features.auto-commit-period 180
storage.linux-aio off
storage.batch-fsync-mode reverse-fsync
storage.batch-fsync-delay-usec 0
storage.owner-uid -1
storage.owner-gid -1
storage.node-uuid-pathinfo off
storage.health-check-interval 30
storage.build-pgfid off
storage.gfid2path on
storage.gfid2path-separator :
storage.reserve 1
storage.health-check-timeout 10
storage.fips-mode-rchecksum off
storage.force-create-mode 0000
storage.force-directory-mode 0000
storage.create-mask 0777
storage.create-directory-mask 0777
storage.max-hardlinks 0
storage.ctime off
storage.bd-aio off
config.gfproxyd off
cluster.server-quorum-type off
cluster.server-quorum-ratio 0
changelog.changelog off
changelog.changelog-dir {{ brick.path }}/.glusterfs/changelogs
changelog.encoding ascii
changelog.rollover-time 15
changelog.fsync-interval 5
changelog.changelog-barrier-timeout 120
changelog.capture-del-path off
features.barrier disable
features.barrier-timeout 120
features.trash off
features.trash-dir .trashcan
features.trash-eliminate-path (null)
features.trash-max-filesize 5MB
features.trash-internal-op off
cluster.enable-shared-storage disable
cluster.write-freq-threshold 0
cluster.read-freq-threshold 0
cluster.tier-pause off
cluster.tier-promote-frequency 120
cluster.tier-demote-frequency 3600
cluster.watermark-hi 90
cluster.watermark-low 75
cluster.tier-mode cache
cluster.tier-max-promote-file-size 0
cluster.tier-max-mb 4000
cluster.tier-max-files 10000
cluster.tier-query-limit 100
cluster.tier-compact on
cluster.tier-hot-compact-frequency 604800
cluster.tier-cold-compact-frequency 604800
features.ctr-enabled off
features.record-counters off
features.ctr-record-metadata-heat off
features.ctr_link_consistency off
features.ctr_lookupheal_link_timeout 300
features.ctr_lookupheal_inode_timeout 300
features.ctr-sql-db-cachesize 12500
features.ctr-sql-db-wal-autocheckpoint 25000
features.selinux on
locks.trace off
locks.mandatory-locking off
cluster.disperse-self-heal-daemon enable
cluster.quorum-reads no
client.bind-insecure (null)
features.shard off
features.shard-block-size 64MB
features.scrub-throttle lazy
features.scrub-freq biweekly
features.scrub false
features.expiry-time 120
features.cache-invalidation on
features.cache-invalidation-timeout 600
features.leases off
features.lease-lock-recall-timeout 60
disperse.background-heals 8
disperse.heal-wait-qlength 128
cluster.heal-timeout 600
dht.force-readdirp on
disperse.read-policy gfid-hash
cluster.shd-max-threads 1
cluster.shd-wait-qlength 1024
cluster.locking-scheme full
cluster.granular-entry-heal no
features.locks-revocation-secs 0
features.locks-revocation-clear-all false
features.locks-revocation-max-blocked 0
features.locks-monkey-unlocking false
features.locks-notify-contention no
features.locks-notify-contention-delay 5
disperse.shd-max-threads 1
disperse.shd-wait-qlength 1024
disperse.cpu-extensions auto
disperse.self-heal-window-size 1
cluster.use-compound-fops off
performance.parallel-readdir on
performance.rda-request-size 131072
performance.rda-low-wmark 4096
performance.rda-high-wmark 128KB
performance.rda-cache-limit 10MB
performance.nl-cache-positive-entry false
performance.nl-cache-limit 10MB
performance.nl-cache-timeout 60
cluster.brick-multiplex off
cluster.max-bricks-per-process 0
disperse.optimistic-change-log on
disperse.stripe-cache 4
cluster.halo-enabled False
cluster.halo-shd-max-latency 99999
cluster.halo-nfsd-max-latency 5
cluster.halo-max-latency 5
cluster.halo-max-replicas 99999
cluster.halo-min-replicas 2
debug.delay-gen off
delay-gen.delay-percentage 10%
delay-gen.delay-duration 100000
delay-gen.enable
disperse.parallel-writes on
features.sdfs off
features.cloudsync off
features.utime off
Erik
More information about the Gluster-users
mailing list