[Gluster-users] split-brain errors under heavy load when one brick down

Mon Sep 16 14:04:25 UTC 2019

Hello all. I'm new to the list but not to gluster.

We are using gluster to service NFS boot on a top500 cluster. It is a
Distributed-Replicate volume 3x9.

We are having a problem when one server in a subvolume goes down, we get
random missing files and split-brain errors in the nfs.log file.

We are using Gluster NFS (We are interested in switching to Ganesha but
this workload presents problems there that we need to work through yet).

Unfortunately, like many such large systems, I am unable to take much
out of the system for debugging and unable to take the system down to
test this very often. However, my hope is to be well prepared when the
next large system comes through the factory so I can try to reproduce
this issue or have some things to try.

In the lab, I have a test system that is also a 3x9 setup like at the
customer site, but with only 3 compute nodes instead of 2,592 compute
nodes. We use CTDB for IP alias management - the compute nodes connect
to NFS with the alias.

Here is the issue we are having:
- 2592 nodes all PXE-booting at once and using the Gluster servers as
  their NFS root is working great. This includes when one subvolume is
  degraded due to the loss of a server. No issues at boot, no split-brain
  messages in the log.
- The problem comes in when we do an intensive job launch. This launch
  uses SLURM and then loads hundreds of shared libraries over NFS across
  all 2592 nodes.
- When all servers in the 3x9 pool are up, we're in good shape - no
  issues on the compute nodes, no split-brain messages in the log.
- When one subvolume has one missing server (its ethernet adapters
  died), while we boot fine, the SLURM launch has random missing files.
  Gluster nfs.log shows split-brain messages and ACCESS I/O errors.
- Taking an example failed file and accessing it across all compute nodes
  always works afterwords, the issue is transient.
- The missing file is always found in the other bricks in the subvolume by
  searching there is well
- No FS/disk IO errors in the logs or dmesg and the files are accessible
  before and after the transient error (and from the bricks themselves as I
  said).
- The customer jobs fail to launch, then, if we are degraded. They fail
  with library read errors, missing config files, etc.

What is perplexing is the huge load of 2592 nodes with NFS roots
PXE-booting does not trigger the issue when one subvolume is degraded.

Thank you for reading this far and thanks to the community for
making Gluster!!

Example errors:

ex1

[2019-09-06 18:26:42.665050] E [MSGID: 108008]
[afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing
ACCESS on gfid ee3f5646-9368-4151-92a3-5b8e7db1fbf9: split-brain observed.
[Input/output error]

ex2

[2019-09-06 18:26:55.359272] E [MSGID: 108008]
[afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing
READLINK on gfid f2be38c2-1cd1-486b-acad-17f2321a18b3: split-brain observed.
[Input/output error]
[2019-09-06 18:26:55.359367] W [MSGID: 112199]
[nfs3-helpers.c:3435:nfs3_log_readlink_res] 0-nfs-nfsv3:
/image/images_ro_nfs/toss-20190730/usr/lib64/libslurm.so.32 => (XID: 88651c80,
READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null)

The errors seem to happen only on the 'replicate' volume where one
server is down in the subvolume (of course, any NFS server will
trigger that when it accesses the files on the degraded volume).

Now, I am no longer able to access this customer system and it is moving
to more secret work so I can't easily run tests on such a big system
until we have something come through the factory. However, I'm desperate
for help and would like a bag of tricks to attack this with next time I
can hit it. Having the HA stuff fail when needed has given me a bit of a
black eye on the solution. I had a lesson learned in being sure to test
the HA solution. I had tested many times at full system boot but didn't
think to do job launch tests while degraded in my testing. That pain
will haunt me but also make me better.

Info on the volumes:
 - RHEL 7.6 x86_64 Gluster/GNFS servers
 - gluster version 4.1.6, I set up the build
 - Clients are AARCH64 NFS 3 clients (technically configured with RO NFS
   (Using a version of Linux somewhat like CentOS 7.6).
 - The base filesystems for bricks are XFS and NO LVM layer.

What follows is the volume info from my test system in the lab, which
has the same versions and setup. I cannot get this info from the
customer without an approval process but the same scripts and tools set
up my test system so I'm confident the settings are the same.

[root at leader1 ~]# gluster volume info

Volume Name: cm_shared
Type: Distributed-Replicate
Volume ID: e7f2796b-7a94-41ab-a07d-bdce4900c731
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 172.23.0.3:/data/brick_cm_shared
Brick2: 172.23.0.4:/data/brick_cm_shared
Brick3: 172.23.0.5:/data/brick_cm_shared
Brick4: 172.23.0.6:/data/brick_cm_shared
Brick5: 172.23.0.7:/data/brick_cm_shared
Brick6: 172.23.0.8:/data/brick_cm_shared
Brick7: 172.23.0.9:/data/brick_cm_shared
Brick8: 172.23.0.10:/data/brick_cm_shared
Brick9: 172.23.0.11:/data/brick_cm_shared
Options Reconfigured:
nfs.nlm: off
nfs.mount-rmtab: /-
performance.nfs.io-cache: on
performance.md-cache-statfs: off
performance.cache-refresh-timeout: 60
storage.max-hardlinks: 0
nfs.acl: on
nfs.outstanding-rpc-limit: 1024
server.outstanding-rpc-limit: 1024
performance.write-behind-window-size: 1024MB
transport.listen-backlog: 16384
performance.write-behind-trickling-writes: off
performance.aggregate-size: 2048KB
performance.flush-behind: on
cluster.lookup-unhashed: auto
performance.parallel-readdir: on
performance.cache-size: 8GB
performance.io-thread-count: 32
network.inode-lru-limit: 1000000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
server.event-threads: 32
client.event-threads: 32
cluster.lookup-optimize: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: false
performance.client-io-threads: on

Volume Name: ctdb
Type: Replicate
Volume ID: 5274a6ce-2ac9-4fc7-8145-dd2b8a97ff3b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 9 = 9
Transport-type: tcp
Bricks:
Brick1: 172.23.0.3:/data/brick_ctdb
Brick2: 172.23.0.4:/data/brick_ctdb
Brick3: 172.23.0.5:/data/brick_ctdb
Brick4: 172.23.0.6:/data/brick_ctdb
Brick5: 172.23.0.7:/data/brick_ctdb
Brick6: 172.23.0.8:/data/brick_ctdb
Brick7: 172.23.0.9:/data/brick_ctdb
Brick8: 172.23.0.10:/data/brick_ctdb
Brick9: 172.23.0.11:/data/brick_ctdb
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet

Here is the setting detail on the cm_shared volume - the one used for
GNFS:

[root at leader1 ~]# gluster volume get cm_shared all
Option                                  Value
------                                  -----
cluster.lookup-unhashed                 auto
cluster.lookup-optimize                 on
cluster.min-free-disk                   10%
cluster.min-free-inodes                 5%
cluster.rebalance-stats                 off
cluster.subvols-per-directory           (null)
cluster.readdir-optimize                off
cluster.rsync-hash-regex                (null)
cluster.extra-hash-regex                (null)
cluster.dht-xattr-name                  trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid    off
cluster.rebal-throttle                  normal
cluster.lock-migration                  off
cluster.force-migration                 off
cluster.local-volume-name               (null)
cluster.weighted-rebalance              on
cluster.switch-pattern                  (null)
cluster.entry-change-log                on
cluster.read-subvolume                  (null)
cluster.read-subvolume-index            -1
cluster.read-hash-mode                  1
cluster.background-self-heal-count      8
cluster.metadata-self-heal              on
cluster.data-self-heal                  on
cluster.entry-self-heal                 on
cluster.self-heal-daemon                on
cluster.heal-timeout                    600
cluster.self-heal-window-size           1
cluster.data-change-log                 on
cluster.metadata-change-log             on
cluster.data-self-heal-algorithm        (null)
cluster.eager-lock                      on
disperse.eager-lock                     on
disperse.other-eager-lock               on
disperse.eager-lock-timeout             1
disperse.other-eager-lock-timeout       1
cluster.quorum-type                     auto
cluster.quorum-count                    (null)
cluster.choose-local                    true
cluster.self-heal-readdir-size          1KB
cluster.post-op-delay-secs              1
cluster.ensure-durability               on
cluster.consistent-metadata             no
cluster.heal-wait-queue-length          128
cluster.favorite-child-policy           none
cluster.full-lock                       yes
cluster.stripe-block-size               128KB
cluster.stripe-coalesce                 true
diagnostics.latency-measurement         off
diagnostics.dump-fd-stats               off
diagnostics.count-fop-hits              off
diagnostics.brick-log-level             INFO
diagnostics.client-log-level            INFO
diagnostics.brick-sys-log-level         CRITICAL
diagnostics.client-sys-log-level        CRITICAL
diagnostics.brick-logger                (null)
diagnostics.client-logger               (null)
diagnostics.brick-log-format            (null)
diagnostics.client-log-format           (null)
diagnostics.brick-log-buf-size          5
diagnostics.client-log-buf-size         5
diagnostics.brick-log-flush-timeout     120
diagnostics.client-log-flush-timeout    120
diagnostics.stats-dump-interval         0
diagnostics.fop-sample-interval         0
diagnostics.stats-dump-format           json
diagnostics.fop-sample-buf-size         65535
diagnostics.stats-dnscache-ttl-sec      86400
performance.cache-max-file-size         0
performance.cache-min-file-size         0
performance.cache-refresh-timeout       60
performance.cache-priority
performance.cache-size                  8GB
performance.io-thread-count             32
performance.high-prio-threads           16
performance.normal-prio-threads         16
performance.low-prio-threads            16
performance.least-prio-threads          1
performance.enable-least-priority       on
performance.iot-watchdog-secs           (null)
performance.iot-cleanup-disconnected-reqsoff
performance.iot-pass-through            false
performance.io-cache-pass-through       false
performance.cache-size                  8GB
performance.qr-cache-timeout            1
performance.cache-invalidation          on
performance.flush-behind                on
performance.nfs.flush-behind            on
performance.write-behind-window-size    1024MB
performance.resync-failed-syncs-after-fsyncoff
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct             off
performance.nfs.strict-o-direct         off
performance.strict-write-ordering       off
performance.nfs.strict-write-ordering   off
performance.write-behind-trickling-writesoff
performance.aggregate-size              2048KB
performance.nfs.write-behind-trickling-writeson
performance.lazy-open                   yes
performance.read-after-open             no
performance.open-behind-pass-through    false
performance.read-ahead-page-count       4
performance.read-ahead-pass-through     false
performance.readdir-ahead-pass-through  false
performance.md-cache-pass-through       false
performance.md-cache-timeout            600
performance.cache-swift-metadata        true
performance.cache-samba-metadata        false
performance.cache-capability-xattrs     true
performance.cache-ima-xattrs            true
performance.md-cache-statfs             off
performance.xattr-cache-list
performance.nl-cache-pass-through       false
features.encryption                     off
encryption.master-key                   (null)
encryption.data-key-size                256
encryption.block-size                   4096
network.frame-timeout                   1800
network.ping-timeout                    42
network.tcp-window-size                 (null)
network.remote-dio                      disable
client.event-threads                    32
client.tcp-user-timeout                 0
client.keepalive-time                   20
client.keepalive-interval               2
client.keepalive-count                  9
network.tcp-window-size                 (null)
network.inode-lru-limit                 1000000
auth.allow                              *
auth.reject                             (null)
transport.keepalive                     1
server.allow-insecure                   on
server.root-squash                      off
server.anonuid                          65534
server.anongid                          65534
server.statedump-path                   /var/run/gluster
server.outstanding-rpc-limit            1024
server.ssl                              (null)
auth.ssl-allow                          *
server.manage-gids                      off
server.dynamic-auth                     on
client.send-gids                        on
server.gid-timeout                      300
server.own-thread                       (null)
server.event-threads                    32
server.tcp-user-timeout                 0
server.keepalive-time                   20
server.keepalive-interval               2
server.keepalive-count                  9
transport.listen-backlog                16384
ssl.own-cert                            (null)
ssl.private-key                         (null)
ssl.ca-list                             (null)
ssl.crl-path                            (null)
ssl.certificate-depth                   (null)
ssl.cipher-list                         (null)
ssl.dh-param                            (null)
ssl.ec-curve                            (null)
transport.address-family                inet
performance.write-behind                on
performance.read-ahead                  on
performance.readdir-ahead               on
performance.io-cache                    on
performance.quick-read                  on
performance.open-behind                 on
performance.nl-cache                    off
performance.stat-prefetch               on
performance.client-io-threads           on
performance.nfs.write-behind            on
performance.nfs.read-ahead              off
performance.nfs.io-cache                on
performance.nfs.quick-read              off
performance.nfs.stat-prefetch           off
performance.nfs.io-threads              off
performance.force-readdirp              true
performance.cache-invalidation          on
features.uss                            off
features.snapshot-directory             .snaps
features.show-snapshot-directory        off
features.tag-namespaces                 off
network.compression                     off
network.compression.window-size         -15
network.compression.mem-level           8
network.compression.min-size            0
network.compression.compression-level   -1
network.compression.debug               false
features.default-soft-limit             80%
features.soft-timeout                   60
features.hard-timeout                   5
features.alert-time                     86400
features.quota-deem-statfs              off
geo-replication.indexing                off
geo-replication.indexing                off
geo-replication.ignore-pid-check        off
geo-replication.ignore-pid-check        off
features.quota                          off
features.inode-quota                    off
features.bitrot                         disable
debug.trace                             off
debug.log-history                       no
debug.log-file                          no
debug.exclude-ops                       (null)
debug.include-ops                       (null)
debug.error-gen                         off
debug.error-failure                     (null)
debug.error-number                      (null)
debug.random-failure                    off
debug.error-fops                        (null)
nfs.enable-ino32                        no
nfs.mem-factor                          15
nfs.export-dirs                         on
nfs.export-volumes                      on
nfs.addr-namelookup                     off
nfs.dynamic-volumes                     off
nfs.register-with-portmap               on
nfs.outstanding-rpc-limit               1024
nfs.port                                2049
nfs.rpc-auth-unix                       on
nfs.rpc-auth-null                       on
nfs.rpc-auth-allow                      all
nfs.rpc-auth-reject                     none
nfs.ports-insecure                      off
nfs.trusted-sync                        off
nfs.trusted-write                       off
nfs.volume-access                       read-write
nfs.export-dir
nfs.disable                             false
nfs.nlm                                 off
nfs.acl                                 on
nfs.mount-udp                           off
nfs.mount-rmtab                         /-
nfs.rpc-statd                           /sbin/rpc.statd
nfs.server-aux-gids                     off
nfs.drc                                 off
nfs.drc-size                            0x20000
nfs.read-size                           (1 * 1048576ULL)
nfs.write-size                          (1 * 1048576ULL)
nfs.readdir-size                        (1 * 1048576ULL)
nfs.rdirplus                            on
nfs.event-threads                       1
nfs.exports-auth-enable                 (null)
nfs.auth-refresh-interval-sec           (null)
nfs.auth-cache-ttl-sec                  (null)
features.read-only                      off
features.worm                           off
features.worm-file-level                off
features.worm-files-deletable           on
features.default-retention-period       120
features.retention-mode                 relax
features.auto-commit-period             180
storage.linux-aio                       off
storage.batch-fsync-mode                reverse-fsync
storage.batch-fsync-delay-usec          0
storage.owner-uid                       -1
storage.owner-gid                       -1
storage.node-uuid-pathinfo              off
storage.health-check-interval           30
storage.build-pgfid                     off
storage.gfid2path                       on
storage.gfid2path-separator             :
storage.reserve                         1
storage.health-check-timeout            10
storage.fips-mode-rchecksum             off
storage.force-create-mode               0000
storage.force-directory-mode            0000
storage.create-mask                     0777
storage.create-directory-mask           0777
storage.max-hardlinks                   0
storage.ctime                           off
storage.bd-aio                          off
config.gfproxyd                         off
cluster.server-quorum-type              off
cluster.server-quorum-ratio             0
changelog.changelog                     off
changelog.changelog-dir                 {{ brick.path }}/.glusterfs/changelogs
changelog.encoding                      ascii
changelog.rollover-time                 15
changelog.fsync-interval                5
changelog.changelog-barrier-timeout     120
changelog.capture-del-path              off
features.barrier                        disable
features.barrier-timeout                120
features.trash                          off
features.trash-dir                      .trashcan
features.trash-eliminate-path           (null)
features.trash-max-filesize             5MB
features.trash-internal-op              off
cluster.enable-shared-storage           disable
cluster.write-freq-threshold            0
cluster.read-freq-threshold             0
cluster.tier-pause                      off
cluster.tier-promote-frequency          120
cluster.tier-demote-frequency           3600
cluster.watermark-hi                    90
cluster.watermark-low                   75
cluster.tier-mode                       cache
cluster.tier-max-promote-file-size      0
cluster.tier-max-mb                     4000
cluster.tier-max-files                  10000
cluster.tier-query-limit                100
cluster.tier-compact                    on
cluster.tier-hot-compact-frequency      604800
cluster.tier-cold-compact-frequency     604800
features.ctr-enabled                    off
features.record-counters                off
features.ctr-record-metadata-heat       off
features.ctr_link_consistency           off
features.ctr_lookupheal_link_timeout    300
features.ctr_lookupheal_inode_timeout   300
features.ctr-sql-db-cachesize           12500
features.ctr-sql-db-wal-autocheckpoint  25000
features.selinux                        on
locks.trace                             off
locks.mandatory-locking                 off
cluster.disperse-self-heal-daemon       enable
cluster.quorum-reads                    no
client.bind-insecure                    (null)
features.shard                          off
features.shard-block-size               64MB
features.scrub-throttle                 lazy
features.scrub-freq                     biweekly
features.scrub                          false
features.expiry-time                    120
features.cache-invalidation             on
features.cache-invalidation-timeout     600
features.leases                         off
features.lease-lock-recall-timeout      60
disperse.background-heals               8
disperse.heal-wait-qlength              128
cluster.heal-timeout                    600
dht.force-readdirp                      on
disperse.read-policy                    gfid-hash
cluster.shd-max-threads                 1
cluster.shd-wait-qlength                1024
cluster.locking-scheme                  full
cluster.granular-entry-heal             no
features.locks-revocation-secs          0
features.locks-revocation-clear-all     false
features.locks-revocation-max-blocked   0
features.locks-monkey-unlocking         false
features.locks-notify-contention        no
features.locks-notify-contention-delay  5
disperse.shd-max-threads                1
disperse.shd-wait-qlength               1024
disperse.cpu-extensions                 auto
disperse.self-heal-window-size          1
cluster.use-compound-fops               off
performance.parallel-readdir            on
performance.rda-request-size            131072
performance.rda-low-wmark               4096
performance.rda-high-wmark              128KB
performance.rda-cache-limit             10MB
performance.nl-cache-positive-entry     false
performance.nl-cache-limit              10MB
performance.nl-cache-timeout            60
cluster.brick-multiplex                 off
cluster.max-bricks-per-process          0
disperse.optimistic-change-log          on
disperse.stripe-cache                   4
cluster.halo-enabled                    False
cluster.halo-shd-max-latency            99999
cluster.halo-nfsd-max-latency           5
cluster.halo-max-latency                5
cluster.halo-max-replicas               99999
cluster.halo-min-replicas               2
debug.delay-gen                         off
delay-gen.delay-percentage              10%
delay-gen.delay-duration                100000
delay-gen.enable
disperse.parallel-writes                on
features.sdfs                           off
features.cloudsync                      off
features.utime                          off

Erik