[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Sun Mar 29 04:10:49 UTC 2020

Hello all,

I am getting split-brain errors in the gnfs nfs.log when 1 gluster
server is down in a 3-brick/3-node gluster volume. It only happens under
intense load.

I reported this a few months ago but didn't have a repeatable test case.
Since then, we got reports from the field and I was able to make a test case
with 3 gluster servers and 76 NFS clients/compute nodes. I point all 76
nodes to one gnfs server to make the problem more likely to happen with the
limited nodes we have in-house.

We are using gluster nfs (ganesha is not yet reliable for our workload)
to export an NFS filesystem that is used for a read-only root filesystem
for NFS clients. The largest client count we have is 2592 across 9
leaders (3 replicated subvolumes) - out in the field. This is where
the problem was first reported.

In the lab, I have a test case that can repeat the problem on a single
subvolume cluster.

Please forgive how ugly the test case is. I'm sure an IO test person can
make it pretty. It basically runs a bunch of cluster-manger NFS-intensive
operations while also producing other load. If one leader is down,
nfs.log reports some split-brain errors. For real-world customers, the
symptom is "some nodes failing to boot" in various ways or "jobs failing
to launch due to permissions or file read problems (like a library not
being readable on one node)". If all leaders are up, we see no errors.

As an attachment, I will include volume settings.

Here are example nfs.log errors:

[2020-03-29 03:42:52.295532] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. [Input/output error]
[2020-03-29 03:42:52.295583] W [MSGID: 112199] [nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3: <gfid:9e721602-2732-4490-bde3-19cac6e33291>/bin/whoami => (XID: 19fb1558, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output error))
[2020-03-29 03:43:03.600023] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 77614c4f-1ac4-448d-8fc2-8aedc9b30868: split-brain observed. [Input/output error]
[2020-03-29 03:43:03.600075] W [MSGID: 112199] [nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3: <gfid:9e721602-2732-4490-bde3-19cac6e33291>/lib64/perl5/vendor_perl/XML/LibXML/Literal.pm => (XID: 9a851abc, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output error))
[2020-03-29 03:43:07.681294] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing READLINK on gfid 36134289-cb2d-43d9-bd50-60e23d7fa69b: split-brain observed. [Input/output error]
[2020-03-29 03:43:07.681339] W [MSGID: 112199] [nfs3-helpers.c:3327:nfs3_log_readlink_res] 0-nfs-nfsv3: <gfid:9e721602-2732-4490-bde3-19cac6e33291>/lib64/.libhogweed.so.4.hmac => (XID: 5c29744f, READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null)

The brick log isn't very interesting during the failure. There are some
ACL errors that don't seem to directly relate to the issue at hand.
(I can attach if requested!)

This is glusterfs72 (although we originally hit it with 4.1.6).
I'm using rhel8 (although field reports are from rhel76).

If there is anything the community can suggest to help me with this, it
would really be appreciated. I'm getting unhappy reports from the field
that the failover doesn't work as expected.

I've tried tweaking several things from various threading settings to
enabling md-cach-statfs to mem-factor to listen backlogs. I even tried
adjusting the cluster.read-hash-mode and choose-local settings.

"cluster-configuration" in the script initiates a bunch of operations on the
node that results in reading many files and doing some database queries. I
used it in my test case as it is a common failure point when nodes are
booting. This test case, although ugly, fails 100% if one server is down and
works 100% if all servers are up.

#! /bin/bash

#
# Test case:
#
# in a 1x3 Gluster Replicated setup with the HPCM volume settings..
#
# On a cluster with 76 nodes (maybe can be replicated with less we don't
# know)
#
# When all the nodes are assigned to one IP alias to get the load in to
# one leader node....
#
# This test case will produce split-brain errors in the nfs.log file
# when 1 leader is down, but will run clean when all 3 are up.
#
# It is not necessary to power off the leader you wish to disable. Simply
# running 'systemctl stop glusterd' is sufficient.
#
# We will use this script to try to resolve the issue with split-brain
# under stress when one leader is down.
#

# (compute group is 76 compute nodes)
echo "killing any node find or node tar commands..."
pdsh -f 500 -g compute killall find
pdsh -f 500 -g compute killall tar

# (in this test, leader1 is known to have glusterd stopped for the test case)
echo "stop, start glusterd, drop caches, sleep 15"
set -x
pdsh -w leader2,leader3 systemctl stop glusterd
sleep 3
pdsh -w leader2,leader3 "echo 3 > /proc/sys/vm/drop_caches"
pdsh -w leader2,leader3 systemctl start glusterd
set +x
sleep 15

echo "drop caches on nodes"
pdsh -f 500 -g compute "echo 3 > /proc/sys/vm/drop_caches"

echo "----------------------------------------------------------------------"
echo "test start"
echo "----------------------------------------------------------------------"

set -x

pdsh -f 500 -g compute "tar cf - /usr > /dev/null" &
pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration
pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration
pdsh -f 500 -g compute "find /usr > /dev/null" &
pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration
pdsh -f 500 -g compute /opt/sgi/lib/cluster-configuration
wait
-------------- next part --------------
Option                                  Value                                   
------                                  -----                                   
cluster.lookup-unhashed                 auto                                    
cluster.lookup-optimize                 on                                      
cluster.min-free-disk                   10%                                     
cluster.min-free-inodes                 5%                                      
cluster.rebalance-stats                 off                                     
cluster.subvols-per-directory           (null)                                  
cluster.readdir-optimize                off                                     
cluster.rsync-hash-regex                (null)                                  
cluster.extra-hash-regex                (null)                                  
cluster.dht-xattr-name                  trusted.glusterfs.dht                   
cluster.randomize-hash-range-by-gfid    off                                     
cluster.rebal-throttle                  normal                                  
cluster.lock-migration                  off                                     
cluster.force-migration                 off                                     
cluster.local-volume-name               (null)                                  
cluster.weighted-rebalance              on                                      
cluster.switch-pattern                  (null)                                  
cluster.entry-change-log                on                                      
cluster.read-subvolume                  (null)                                  
cluster.read-subvolume-index            -1                                      
cluster.read-hash-mode                  1                                       
cluster.background-self-heal-count      8                                       
cluster.metadata-self-heal              off                                     
cluster.data-self-heal                  off                                     
cluster.entry-self-heal                 off                                     
cluster.self-heal-daemon                on                                      
cluster.heal-timeout                    600                                     
cluster.self-heal-window-size           1                                       
cluster.data-change-log                 on                                      
cluster.metadata-change-log             on                                      
cluster.data-self-heal-algorithm        (null)                                  
cluster.eager-lock                      on                                      
disperse.eager-lock                     on                                      
disperse.other-eager-lock               on                                      
disperse.eager-lock-timeout             1                                       
disperse.other-eager-lock-timeout       1                                       
cluster.quorum-type                     auto                                    
cluster.quorum-count                    (null)                                  
cluster.choose-local                    true                                    
cluster.self-heal-readdir-size          1KB                                     
cluster.post-op-delay-secs              1                                       
cluster.ensure-durability               on                                      
cluster.consistent-metadata             no                                      
cluster.heal-wait-queue-length          128                                     
cluster.favorite-child-policy           none                                    
cluster.full-lock                       yes                                     
cluster.optimistic-change-log           on                                      
diagnostics.latency-measurement         off                                     
diagnostics.dump-fd-stats               off                                     
diagnostics.count-fop-hits              off                                     
diagnostics.brick-log-level             INFO                                    
diagnostics.client-log-level            INFO                                    
diagnostics.brick-sys-log-level         CRITICAL                                
diagnostics.client-sys-log-level        CRITICAL                                
diagnostics.brick-logger                (null)                                  
diagnostics.client-logger               (null)                                  
diagnostics.brick-log-format            (null)                                  
diagnostics.client-log-format           (null)                                  
diagnostics.brick-log-buf-size          5                                       
diagnostics.client-log-buf-size         5                                       
diagnostics.brick-log-flush-timeout     120                                     
diagnostics.client-log-flush-timeout    120                                     
diagnostics.stats-dump-interval         0                                       
diagnostics.fop-sample-interval         0                                       
diagnostics.stats-dump-format           json                                    
diagnostics.fop-sample-buf-size         65535                                   
diagnostics.stats-dnscache-ttl-sec      86400                                   
performance.cache-max-file-size         0                                       
performance.cache-min-file-size         0                                       
performance.cache-refresh-timeout       60                                      
performance.cache-priority                                                      
performance.cache-size                  8GB                                     
performance.io-thread-count             32                                      
performance.high-prio-threads           16                                      
performance.normal-prio-threads         16                                      
performance.low-prio-threads            16                                      
performance.least-prio-threads          1                                       
performance.enable-least-priority       on                                      
performance.iot-watchdog-secs           (null)                                  
performance.iot-cleanup-disconnected-reqsoff                                     
performance.iot-pass-through            false                                   
performance.io-cache-pass-through       false                                   
performance.cache-size                  8GB                                     
performance.qr-cache-timeout            1                                       
performance.cache-invalidation          on                                      
performance.ctime-invalidation          false                                   
performance.flush-behind                on                                      
performance.nfs.flush-behind            on                                      
performance.write-behind-window-size    1024MB                                  
performance.resync-failed-syncs-after-fsyncoff                                     
performance.nfs.write-behind-window-size1MB                                     
performance.strict-o-direct             off                                     
performance.nfs.strict-o-direct         off                                     
performance.strict-write-ordering       off                                     
performance.nfs.strict-write-ordering   off                                     
performance.write-behind-trickling-writesoff                                     
performance.aggregate-size              2048KB                                  
performance.nfs.write-behind-trickling-writeson                                      
performance.lazy-open                   yes                                     
performance.read-after-open             yes                                     
performance.open-behind-pass-through    false                                   
performance.read-ahead-page-count       4                                       
performance.read-ahead-pass-through     false                                   
performance.readdir-ahead-pass-through  false                                   
performance.md-cache-pass-through       false                                   
performance.md-cache-timeout            600                                     
performance.cache-swift-metadata        true                                    
performance.cache-samba-metadata        false                                   
performance.cache-capability-xattrs     true                                    
performance.cache-ima-xattrs            true                                    
performance.md-cache-statfs             off                                     
performance.xattr-cache-list                                                    
performance.nl-cache-pass-through       false                                   
network.frame-timeout                   1800                                    
network.ping-timeout                    42                                      
network.tcp-window-size                 (null)                                  
client.ssl                              off                                     
network.remote-dio                      disable                                 
client.event-threads                    32                                      
client.tcp-user-timeout                 0                                       
client.keepalive-time                   20                                      
client.keepalive-interval               2                                       
client.keepalive-count                  9                                       
network.tcp-window-size                 (null)                                  
network.inode-lru-limit                 1000000                                 
auth.allow                              *                                       
auth.reject                             (null)                                  
transport.keepalive                     1                                       
server.allow-insecure                   on                                      
server.root-squash                      off                                     
server.all-squash                       off                                     
server.anonuid                          65534                                   
server.anongid                          65534                                   
server.statedump-path                   /var/run/gluster                        
server.outstanding-rpc-limit            1024                                    
server.ssl                              off                                     
auth.ssl-allow                          *                                       
server.manage-gids                      off                                     
server.dynamic-auth                     on                                      
client.send-gids                        on                                      
server.gid-timeout                      300                                     
server.own-thread                       (null)                                  
server.event-threads                    32                                      
server.tcp-user-timeout                 42                                      
server.keepalive-time                   20                                      
server.keepalive-interval               2                                       
server.keepalive-count                  9                                       
transport.listen-backlog                16384                                   
transport.address-family                inet                                    
performance.write-behind                on                                      
performance.read-ahead                  on                                      
performance.readdir-ahead               on                                      
performance.io-cache                    on                                      
performance.open-behind                 on                                      
performance.quick-read                  on                                      
performance.nl-cache                    off                                     
performance.stat-prefetch               on                                      
performance.client-io-threads           on                                      
performance.nfs.write-behind            on                                      
performance.nfs.read-ahead              off                                     
performance.nfs.io-cache                on                                      
performance.nfs.quick-read              off                                     
performance.nfs.stat-prefetch           off                                     
performance.nfs.io-threads              off                                     
performance.force-readdirp              true                                    
performance.cache-invalidation          on                                      
performance.global-cache-invalidation   true                                    
features.uss                            off                                     
features.snapshot-directory             .snaps                                  
features.show-snapshot-directory        off                                     
features.tag-namespaces                 off                                     
network.compression                     off                                     
network.compression.window-size         -15                                     
network.compression.mem-level           8                                       
network.compression.min-size            0                                       
network.compression.compression-level   -1                                      
network.compression.debug               false                                   
features.default-soft-limit             80%                                     
features.soft-timeout                   60                                      
features.hard-timeout                   5                                       
features.alert-time                     86400                                   
features.quota-deem-statfs              off                                     
geo-replication.indexing                off                                     
geo-replication.indexing                off                                     
geo-replication.ignore-pid-check        off                                     
geo-replication.ignore-pid-check        off                                     
features.quota                          off                                     
features.inode-quota                    off                                     
features.bitrot                         disable                                 
debug.trace                             off                                     
debug.log-history                       no                                      
debug.log-file                          no                                      
debug.exclude-ops                       (null)                                  
debug.include-ops                       (null)                                  
debug.error-gen                         off                                     
debug.error-failure                     (null)                                  
debug.error-number                      (null)                                  
debug.random-failure                    off                                     
debug.error-fops                        (null)                                  
nfs.enable-ino32                        no                                      
nfs.mem-factor                          15                                      
nfs.export-dirs                         on                                      
nfs.export-volumes                      on                                      
nfs.addr-namelookup                     off                                     
nfs.dynamic-volumes                     off                                     
nfs.register-with-portmap               on                                      
nfs.outstanding-rpc-limit               1024                                    
nfs.port                                2049                                    
nfs.rpc-auth-unix                       on                                      
nfs.rpc-auth-null                       on                                      
nfs.rpc-auth-allow                      all                                     
nfs.rpc-auth-reject                     none                                    
nfs.ports-insecure                      off                                     
nfs.trusted-sync                        off                                     
nfs.trusted-write                       off                                     
nfs.volume-access                       read-write                              
nfs.export-dir                                                                  
nfs.disable                             off                                     
nfs.nlm                                 off                                     
nfs.acl                                 on                                      
nfs.mount-udp                           off                                     
nfs.mount-rmtab                         /-                                      
nfs.rpc-statd                           /sbin/rpc.statd                         
nfs.server-aux-gids                     off                                     
nfs.drc                                 off                                     
nfs.drc-size                            0x20000                                 
nfs.read-size                           (1 * 1048576ULL)                        
nfs.write-size                          (1 * 1048576ULL)                        
nfs.readdir-size                        (1 * 1048576ULL)                        
nfs.rdirplus                            on                                      
nfs.event-threads                       3                                       
nfs.exports-auth-enable                 on                                      
nfs.auth-refresh-interval-sec           360                                     
nfs.auth-cache-ttl-sec                  360                                     
features.read-only                      off                                     
features.worm                           off                                     
features.worm-file-level                off                                     
features.worm-files-deletable           on                                      
features.default-retention-period       120                                     
features.retention-mode                 relax                                   
features.auto-commit-period             180                                     
storage.linux-aio                       off                                     
storage.batch-fsync-mode                reverse-fsync                           
storage.batch-fsync-delay-usec          0                                       
storage.owner-uid                       -1                                      
storage.owner-gid                       -1                                      
storage.node-uuid-pathinfo              off                                     
storage.health-check-interval           30                                      
storage.build-pgfid                     off                                     
storage.gfid2path                       on                                      
storage.gfid2path-separator             :                                       
storage.reserve                         1                                       
storage.reserve-size                    0                                       
storage.health-check-timeout            10                                      
storage.fips-mode-rchecksum             on                                      
storage.force-create-mode               0000                                    
storage.force-directory-mode            0000                                    
storage.create-mask                     0777                                    
storage.create-directory-mask           0777                                    
storage.max-hardlinks                   0                                       
features.ctime                          on                                      
config.gfproxyd                         off                                     
cluster.server-quorum-type              off                                     
cluster.server-quorum-ratio             51                                      
changelog.changelog                     off                                     
changelog.changelog-dir                 {{ brick.path }}/.glusterfs/changelogs  
changelog.encoding                      ascii                                   
changelog.rollover-time                 15                                      
changelog.fsync-interval                5                                       
changelog.changelog-barrier-timeout     120                                     
changelog.capture-del-path              off                                     
features.barrier                        disable                                 
features.barrier-timeout                120                                     
features.trash                          off                                     
features.trash-dir                      .trashcan                               
features.trash-eliminate-path           (null)                                  
features.trash-max-filesize             5MB                                     
features.trash-internal-op              off                                     
cluster.enable-shared-storage           disable                                 
locks.trace                             off                                     
locks.mandatory-locking                 off                                     
cluster.disperse-self-heal-daemon       enable                                  
cluster.quorum-reads                    no                                      
client.bind-insecure                    (null)                                  
features.shard                          off                                     
features.shard-block-size               64MB                                    
features.shard-lru-limit                16384                                   
features.shard-deletion-rate            100                                     
features.scrub-throttle                 lazy                                    
features.scrub-freq                     biweekly                                
features.scrub                          false                                   
features.expiry-time                    120                                     
features.cache-invalidation             on                                      
features.cache-invalidation-timeout     600                                     
features.leases                         off                                     
features.lease-lock-recall-timeout      60                                      
disperse.background-heals               8                                       
disperse.heal-wait-qlength              128                                     
cluster.heal-timeout                    600                                     
dht.force-readdirp                      on                                      
disperse.read-policy                    gfid-hash                               
cluster.shd-max-threads                 1                                       
cluster.shd-wait-qlength                1024                                    
cluster.locking-scheme                  full                                    
cluster.granular-entry-heal             no                                      
features.locks-revocation-secs          0                                       
features.locks-revocation-clear-all     false                                   
features.locks-revocation-max-blocked   0                                       
features.locks-monkey-unlocking         false                                   
features.locks-notify-contention        no                                      
features.locks-notify-contention-delay  5                                       
disperse.shd-max-threads                1                                       
disperse.shd-wait-qlength               1024                                    
disperse.cpu-extensions                 auto                                    
disperse.self-heal-window-size          1                                       
cluster.use-compound-fops               off                                     
performance.parallel-readdir            on                                      
performance.rda-request-size            131072                                  
performance.rda-low-wmark               4096                                    
performance.rda-high-wmark              128KB                                   
performance.rda-cache-limit             10MB                                    
performance.nl-cache-positive-entry     false                                   
performance.nl-cache-limit              10MB                                    
performance.nl-cache-timeout            60                                      
cluster.brick-multiplex                 disable                                 
glusterd.vol_count_per_thread           100                                     
cluster.max-bricks-per-process          250                                     
disperse.optimistic-change-log          on                                      
disperse.stripe-cache                   4                                       
cluster.halo-enabled                    False                                   
cluster.halo-shd-max-latency            99999                                   
cluster.halo-nfsd-max-latency           5                                       
cluster.halo-max-latency                5                                       
cluster.halo-max-replicas               99999                                   
cluster.halo-min-replicas               2                                       
features.selinux                        on                                      
cluster.daemon-log-level                INFO                                    
debug.delay-gen                         off                                     
delay-gen.delay-percentage              10%                                     
delay-gen.delay-duration                100000                                  
delay-gen.enable                                                                
disperse.parallel-writes                on                                      
features.sdfs                           off                                     
features.cloudsync                      off                                     
features.ctime                          on                                      
ctime.noatime                           on                                      
features.cloudsync-storetype            (null)                                  
features.enforce-mandatory-lock         off                                     
config.global-threading                 off                                     
config.client-threads                   16                                      
config.brick-threads                    16                                      
features.cloudsync-remote-read          off                                     
features.cloudsync-store-id             (null)                                  
features.cloudsync-product-id           (null)