[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
Erik Jacobson
erik.jacobson at hpe.com
Sun Mar 29 04:10:49 UTC 2020
Hello all,
I am getting split-brain errors in the gnfs nfs.log when 1 gluster
server is down in a 3-brick/3-node gluster volume. It only happens under
intense load.
I reported this a few months ago but didn't have a repeatable test case.
Since then, we got reports from the field and I was able to make a test case
with 3 gluster servers and 76 NFS clients/compute nodes. I point all 76
nodes to one gnfs server to make the problem more likely to happen with the
limited nodes we have in-house.
We are using gluster nfs (ganesha is not yet reliable for our workload)
to export an NFS filesystem that is used for a read-only root filesystem
for NFS clients. The largest client count we have is 2592 across 9
leaders (3 replicated subvolumes) - out in the field. This is where
the problem was first reported.
In the lab, I have a test case that can repeat the problem on a single
subvolume cluster.
Please forgive how ugly the test case is. I'm sure an IO test person can
make it pretty. It basically runs a bunch of cluster-manger NFS-intensive
operations while also producing other load. If one leader is down,
nfs.log reports some split-brain errors. For real-world customers, the
symptom is "some nodes failing to boot" in various ways or "jobs failing
to launch due to permissions or file read problems (like a library not
being readable on one node)". If all leaders are up, we see no errors.
As an attachment, I will include volume settings.
Here are example nfs.log errors:
[2020-03-29 03:42:52.295532] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. [Input/output error]
[2020-03-29 03:42:52.295583] W [MSGID: 112199] [nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3: <gfid:9e721602-2732-4490-bde3-19cac6e33291>/bin/whoami => (XID: 19fb1558, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output error))
[2020-03-29 03:43:03.600023] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 77614c4f-1ac4-448d-8fc2-8aedc9b30868: split-brain observed. [Input/output error]
[2020-03-29 03:43:03.600075] W [MSGID: 112199] [nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3: <gfid:9e721602-2732-4490-bde3-19cac6e33291>/lib64/perl5/vendor_perl/XML/LibXML/Literal.pm => (XID: 9a851abc, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output error))
[2020-03-29 03:43:07.681294] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing READLINK on gfid 36134289-cb2d-43d9-bd50-60e23d7fa69b: split-brain observed. [Input/output error]
[2020-03-29 03:43:07.681339] W [MSGID: 112199] [nfs3-helpers.c:3327:nfs3_log_readlink_res] 0-nfs-nfsv3: <gfid:9e721602-2732-4490-bde3-19cac6e33291>/lib64/.libhogweed.so.4.hmac => (XID: 5c29744f, READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null)
The brick log isn't very interesting during the failure. There are some
ACL errors that don't seem to directly relate to the issue at hand.
(I can attach if requested!)
This is glusterfs72 (although we originally hit it with 4.1.6).
I'm using rhel8 (although field reports are from rhel76).
If there is anything the community can suggest to help me with this, it
would really be appreciated. I'm getting unhappy reports from the field
that the failover doesn't work as expected.
I've tried tweaking several things from various threading settings to
enabling md-cach-statfs to mem-factor to listen backlogs. I even tried
adjusting the cluster.read-hash-mode and choose-local settings.
"cluster-configuration" in the script initiates a bunch of operations on the
node that results in reading many files and doing some database queries. I
used it in my test case as it is a common failure point when nodes are
booting. This test case, although ugly, fails 100% if one server is down and
works 100% if all servers are up.
#! /bin/bash
#
# Test case:
#
# in a 1x3 Gluster Replicated setup with the HPCM volume settings..
#
# On a cluster with 76 nodes (maybe can be replicated with less we don't
# know)
#
# When all the nodes are assigned to one IP alias to get the load in to
# one leader node....
#
# This test case will produce split-brain errors in the nfs.log file
# when 1 leader is down, but will run clean when all 3 are up.
#
# It is not necessary to power off the leader you wish to disable. Simply
# running 'systemctl stop glusterd' is sufficient.
#
# We will use this script to try to resolve the issue with split-brain
# under stress when one leader is down.
#
# (compute group is 76 compute nodes)
echo "killing any node find or node tar commands..."
pdsh -f 500 -g compute killall find
pdsh -f 500 -g compute killall tar
# (in this test, leader1 is known to have glusterd stopped for the test case)
echo "stop, start glusterd, drop caches, sleep 15"
set -x
pdsh -w leader2,leader3 systemctl stop glusterd
sleep 3
pdsh -w leader2,leader3 "echo 3 > /proc/sys/vm/drop_caches"
pdsh -w leader2,leader3 systemctl start glusterd
set +x
sleep 15
echo "drop caches on nodes"
pdsh -f 500 -g compute "echo 3 > /proc/sys/vm/drop_caches"
echo "----------------------------------------------------------------------"
echo "test start"
echo "----------------------------------------------------------------------"
set -x
pdsh -f 500 -g compute "tar cf - /usr > /dev/null" &
pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration
pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration
pdsh -f 500 -g compute "find /usr > /dev/null" &
pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration
pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration
wait
-------------- next part --------------
Option Value
------ -----
cluster.lookup-unhashed auto
cluster.lookup-optimize on
cluster.min-free-disk 10%
cluster.min-free-inodes 5%
cluster.rebalance-stats off
cluster.subvols-per-directory (null)
cluster.readdir-optimize off
cluster.rsync-hash-regex (null)
cluster.extra-hash-regex (null)
cluster.dht-xattr-name trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid off
cluster.rebal-throttle normal
cluster.lock-migration off
cluster.force-migration off
cluster.local-volume-name (null)
cluster.weighted-rebalance on
cluster.switch-pattern (null)
cluster.entry-change-log on
cluster.read-subvolume (null)
cluster.read-subvolume-index -1
cluster.read-hash-mode 1
cluster.background-self-heal-count 8
cluster.metadata-self-heal off
cluster.data-self-heal off
cluster.entry-self-heal off
cluster.self-heal-daemon on
cluster.heal-timeout 600
cluster.self-heal-window-size 1
cluster.data-change-log on
cluster.metadata-change-log on
cluster.data-self-heal-algorithm (null)
cluster.eager-lock on
disperse.eager-lock on
disperse.other-eager-lock on
disperse.eager-lock-timeout 1
disperse.other-eager-lock-timeout 1
cluster.quorum-type auto
cluster.quorum-count (null)
cluster.choose-local true
cluster.self-heal-readdir-size 1KB
cluster.post-op-delay-secs 1
cluster.ensure-durability on
cluster.consistent-metadata no
cluster.heal-wait-queue-length 128
cluster.favorite-child-policy none
cluster.full-lock yes
cluster.optimistic-change-log on
diagnostics.latency-measurement off
diagnostics.dump-fd-stats off
diagnostics.count-fop-hits off
diagnostics.brick-log-level INFO
diagnostics.client-log-level INFO
diagnostics.brick-sys-log-level CRITICAL
diagnostics.client-sys-log-level CRITICAL
diagnostics.brick-logger (null)
diagnostics.client-logger (null)
diagnostics.brick-log-format (null)
diagnostics.client-log-format (null)
diagnostics.brick-log-buf-size 5
diagnostics.client-log-buf-size 5
diagnostics.brick-log-flush-timeout 120
diagnostics.client-log-flush-timeout 120
diagnostics.stats-dump-interval 0
diagnostics.fop-sample-interval 0
diagnostics.stats-dump-format json
diagnostics.fop-sample-buf-size 65535
diagnostics.stats-dnscache-ttl-sec 86400
performance.cache-max-file-size 0
performance.cache-min-file-size 0
performance.cache-refresh-timeout 60
performance.cache-priority
performance.cache-size 8GB
performance.io-thread-count 32
performance.high-prio-threads 16
performance.normal-prio-threads 16
performance.low-prio-threads 16
performance.least-prio-threads 1
performance.enable-least-priority on
performance.iot-watchdog-secs (null)
performance.iot-cleanup-disconnected-reqsoff
performance.iot-pass-through false
performance.io-cache-pass-through false
performance.cache-size 8GB
performance.qr-cache-timeout 1
performance.cache-invalidation on
performance.ctime-invalidation false
performance.flush-behind on
performance.nfs.flush-behind on
performance.write-behind-window-size 1024MB
performance.resync-failed-syncs-after-fsyncoff
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct off
performance.nfs.strict-o-direct off
performance.strict-write-ordering off
performance.nfs.strict-write-ordering off
performance.write-behind-trickling-writesoff
performance.aggregate-size 2048KB
performance.nfs.write-behind-trickling-writeson
performance.lazy-open yes
performance.read-after-open yes
performance.open-behind-pass-through false
performance.read-ahead-page-count 4
performance.read-ahead-pass-through false
performance.readdir-ahead-pass-through false
performance.md-cache-pass-through false
performance.md-cache-timeout 600
performance.cache-swift-metadata true
performance.cache-samba-metadata false
performance.cache-capability-xattrs true
performance.cache-ima-xattrs true
performance.md-cache-statfs off
performance.xattr-cache-list
performance.nl-cache-pass-through false
network.frame-timeout 1800
network.ping-timeout 42
network.tcp-window-size (null)
client.ssl off
network.remote-dio disable
client.event-threads 32
client.tcp-user-timeout 0
client.keepalive-time 20
client.keepalive-interval 2
client.keepalive-count 9
network.tcp-window-size (null)
network.inode-lru-limit 1000000
auth.allow *
auth.reject (null)
transport.keepalive 1
server.allow-insecure on
server.root-squash off
server.all-squash off
server.anonuid 65534
server.anongid 65534
server.statedump-path /var/run/gluster
server.outstanding-rpc-limit 1024
server.ssl off
auth.ssl-allow *
server.manage-gids off
server.dynamic-auth on
client.send-gids on
server.gid-timeout 300
server.own-thread (null)
server.event-threads 32
server.tcp-user-timeout 42
server.keepalive-time 20
server.keepalive-interval 2
server.keepalive-count 9
transport.listen-backlog 16384
transport.address-family inet
performance.write-behind on
performance.read-ahead on
performance.readdir-ahead on
performance.io-cache on
performance.open-behind on
performance.quick-read on
performance.nl-cache off
performance.stat-prefetch on
performance.client-io-threads on
performance.nfs.write-behind on
performance.nfs.read-ahead off
performance.nfs.io-cache on
performance.nfs.quick-read off
performance.nfs.stat-prefetch off
performance.nfs.io-threads off
performance.force-readdirp true
performance.cache-invalidation on
performance.global-cache-invalidation true
features.uss off
features.snapshot-directory .snaps
features.show-snapshot-directory off
features.tag-namespaces off
network.compression off
network.compression.window-size -15
network.compression.mem-level 8
network.compression.min-size 0
network.compression.compression-level -1
network.compression.debug false
features.default-soft-limit 80%
features.soft-timeout 60
features.hard-timeout 5
features.alert-time 86400
features.quota-deem-statfs off
geo-replication.indexing off
geo-replication.indexing off
geo-replication.ignore-pid-check off
geo-replication.ignore-pid-check off
features.quota off
features.inode-quota off
features.bitrot disable
debug.trace off
debug.log-history no
debug.log-file no
debug.exclude-ops (null)
debug.include-ops (null)
debug.error-gen off
debug.error-failure (null)
debug.error-number (null)
debug.random-failure off
debug.error-fops (null)
nfs.enable-ino32 no
nfs.mem-factor 15
nfs.export-dirs on
nfs.export-volumes on
nfs.addr-namelookup off
nfs.dynamic-volumes off
nfs.register-with-portmap on
nfs.outstanding-rpc-limit 1024
nfs.port 2049
nfs.rpc-auth-unix on
nfs.rpc-auth-null on
nfs.rpc-auth-allow all
nfs.rpc-auth-reject none
nfs.ports-insecure off
nfs.trusted-sync off
nfs.trusted-write off
nfs.volume-access read-write
nfs.export-dir
nfs.disable off
nfs.nlm off
nfs.acl on
nfs.mount-udp off
nfs.mount-rmtab /-
nfs.rpc-statd /sbin/rpc.statd
nfs.server-aux-gids off
nfs.drc off
nfs.drc-size 0x20000
nfs.read-size (1 * 1048576ULL)
nfs.write-size (1 * 1048576ULL)
nfs.readdir-size (1 * 1048576ULL)
nfs.rdirplus on
nfs.event-threads 3
nfs.exports-auth-enable on
nfs.auth-refresh-interval-sec 360
nfs.auth-cache-ttl-sec 360
features.read-only off
features.worm off
features.worm-file-level off
features.worm-files-deletable on
features.default-retention-period 120
features.retention-mode relax
features.auto-commit-period 180
storage.linux-aio off
storage.batch-fsync-mode reverse-fsync
storage.batch-fsync-delay-usec 0
storage.owner-uid -1
storage.owner-gid -1
storage.node-uuid-pathinfo off
storage.health-check-interval 30
storage.build-pgfid off
storage.gfid2path on
storage.gfid2path-separator :
storage.reserve 1
storage.reserve-size 0
storage.health-check-timeout 10
storage.fips-mode-rchecksum on
storage.force-create-mode 0000
storage.force-directory-mode 0000
storage.create-mask 0777
storage.create-directory-mask 0777
storage.max-hardlinks 0
features.ctime on
config.gfproxyd off
cluster.server-quorum-type off
cluster.server-quorum-ratio 51
changelog.changelog off
changelog.changelog-dir {{ brick.path }}/.glusterfs/changelogs
changelog.encoding ascii
changelog.rollover-time 15
changelog.fsync-interval 5
changelog.changelog-barrier-timeout 120
changelog.capture-del-path off
features.barrier disable
features.barrier-timeout 120
features.trash off
features.trash-dir .trashcan
features.trash-eliminate-path (null)
features.trash-max-filesize 5MB
features.trash-internal-op off
cluster.enable-shared-storage disable
locks.trace off
locks.mandatory-locking off
cluster.disperse-self-heal-daemon enable
cluster.quorum-reads no
client.bind-insecure (null)
features.shard off
features.shard-block-size 64MB
features.shard-lru-limit 16384
features.shard-deletion-rate 100
features.scrub-throttle lazy
features.scrub-freq biweekly
features.scrub false
features.expiry-time 120
features.cache-invalidation on
features.cache-invalidation-timeout 600
features.leases off
features.lease-lock-recall-timeout 60
disperse.background-heals 8
disperse.heal-wait-qlength 128
cluster.heal-timeout 600
dht.force-readdirp on
disperse.read-policy gfid-hash
cluster.shd-max-threads 1
cluster.shd-wait-qlength 1024
cluster.locking-scheme full
cluster.granular-entry-heal no
features.locks-revocation-secs 0
features.locks-revocation-clear-all false
features.locks-revocation-max-blocked 0
features.locks-monkey-unlocking false
features.locks-notify-contention no
features.locks-notify-contention-delay 5
disperse.shd-max-threads 1
disperse.shd-wait-qlength 1024
disperse.cpu-extensions auto
disperse.self-heal-window-size 1
cluster.use-compound-fops off
performance.parallel-readdir on
performance.rda-request-size 131072
performance.rda-low-wmark 4096
performance.rda-high-wmark 128KB
performance.rda-cache-limit 10MB
performance.nl-cache-positive-entry false
performance.nl-cache-limit 10MB
performance.nl-cache-timeout 60
cluster.brick-multiplex disable
glusterd.vol_count_per_thread 100
cluster.max-bricks-per-process 250
disperse.optimistic-change-log on
disperse.stripe-cache 4
cluster.halo-enabled False
cluster.halo-shd-max-latency 99999
cluster.halo-nfsd-max-latency 5
cluster.halo-max-latency 5
cluster.halo-max-replicas 99999
cluster.halo-min-replicas 2
features.selinux on
cluster.daemon-log-level INFO
debug.delay-gen off
delay-gen.delay-percentage 10%
delay-gen.delay-duration 100000
delay-gen.enable
disperse.parallel-writes on
features.sdfs off
features.cloudsync off
features.ctime on
ctime.noatime on
features.cloudsync-storetype (null)
features.enforce-mandatory-lock off
config.global-threading off
config.client-threads 16
config.brick-threads 16
features.cloudsync-remote-read off
features.cloudsync-store-id (null)
features.cloudsync-product-id (null)
More information about the Gluster-users
mailing list