[Gluster-users] One of cluster work super slow (v6.8)

Tue Jun 23 17:42:50 UTC 2020

What  is the OS and it's version ?
I have seen similar behaviour (different workload)  on RHEL 7.6 (and below).

Have you checked  what processes are in 'R' or 'D' state on  st2a  ?

Best Regards,
Strahil Nikolov

На 23 юни 2020 г. 19:31:12 GMT+03:00, Pavel Znamensky <kompastver at gmail.com> написа:
>Hi all,
>There's something strange with one of our clusters and glusterfs
>version
>6.8: it's quite slow and one node is overloaded.
>This is distributed cluster with four servers with the same
>specs/OS/versions:
>
>Volume Name: st2
>Type: Distributed-Replicate
>Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45
>Status: Started
>Snapshot Count: 0
>Number of Bricks: 2 x 2 = 4
>Transport-type: tcp
>Bricks:
>Brick1: st2a:/vol3/st2
>Brick2: st2b:/vol3/st2
>Brick3: st2c:/vol3/st2
>Brick4: st2d:/vol3/st2
>Options Reconfigured:
>cluster.rebal-throttle: aggressive
>nfs.disable: on
>performance.readdir-ahead: off
>transport.address-family: inet6
>performance.quick-read: off
>performance.cache-size: 1GB
>performance.io-cache: on
>performance.io-thread-count: 16
>cluster.data-self-heal-algorithm: full
>network.ping-timeout: 20
>server.event-threads: 2
>client.event-threads: 2
>cluster.readdir-optimize: on
>performance.read-ahead: off
>performance.parallel-readdir: on
>cluster.self-heal-daemon: enable
>storage.health-check-timeout: 20
>
>op.version for this cluster remains 50400
>
>st2a is a replica for the st2b and st2c is a replica for st2d.
>All our 50 clients mount this volume using FUSE and in contrast with
>other
>our cluster this one works terrible slow.
>Interesting thing here is that there are very low HDDs and network
>utilization from one hand and quite overloaded server from another
>hand.
>Also, there are no files which should be healed according to `gluster
>volume heal st2 info`.
>Load average across servers:
>st2a:
>load average: 28,73, 26,39, 27,44
>st2b:
>load average: 0,24, 0,46, 0,76
>st2c:
>load average: 0,13, 0,20, 0,27
>st2d:
>load average:2,93, 2,11, 1,50
>
>If we stop glusterfs on st2a server the cluster will work as fast as we
>expected.
>Previously the cluster worked on a version 5.x and there were no such
>problems.
>
>Interestingly, that almost all CPU usage on st2a generates by a
>"system"
>load.
>The most CPU intensive process is glusterfsd.
>`top -H` for glusterfsd process shows this:
>
>PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+
>COMMAND
>
>13894 root      20   0 2172892  96488   9056 R 74,0  0,1 122:09.14
>glfs_iotwr00a
>13888 root      20   0 2172892  96488   9056 R 73,7  0,1 121:38.26
>glfs_iotwr004
>13891 root      20   0 2172892  96488   9056 R 73,7  0,1 121:53.83
>glfs_iotwr007
>13920 root      20   0 2172892  96488   9056 R 73,0  0,1 122:11.27
>glfs_iotwr00f
>13897 root      20   0 2172892  96488   9056 R 68,3  0,1 121:09.82
>glfs_iotwr00d
>13896 root      20   0 2172892  96488   9056 R 68,0  0,1 122:03.99
>glfs_iotwr00c
>13868 root      20   0 2172892  96488   9056 R 67,7  0,1 122:42.55
>glfs_iotwr000
>13889 root      20   0 2172892  96488   9056 R 67,3  0,1 122:17.02
>glfs_iotwr005
>13887 root      20   0 2172892  96488   9056 R 67,0  0,1 122:29.88
>glfs_iotwr003
>13885 root      20   0 2172892  96488   9056 R 65,0  0,1 122:04.85
>glfs_iotwr001
>13892 root      20   0 2172892  96488   9056 R 55,0  0,1 121:15.23
>glfs_iotwr008
>13890 root      20   0 2172892  96488   9056 R 54,7  0,1 121:27.88
>glfs_iotwr006
>13895 root      20   0 2172892  96488   9056 R 54,0  0,1 121:28.35
>glfs_iotwr00b
>13893 root      20   0 2172892  96488   9056 R 53,0  0,1 122:23.12
>glfs_iotwr009
>13898 root      20   0 2172892  96488   9056 R 52,0  0,1 122:30.67
>glfs_iotwr00e
>13886 root      20   0 2172892  96488   9056 R 41,3  0,1 121:26.97
>glfs_iotwr002
>13878 root      20   0 2172892  96488   9056 S  1,0  0,1   1:20.34
>glfs_rpcrqhnd
>13840 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.54
>glfs_epoll000
>13841 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.14
>glfs_epoll001
>13877 root      20   0 2172892  96488   9056 S  0,3  0,1   1:20.02
>glfs_rpcrqhnd
>13833 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00
>glusterfsd
>13834 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.14
>glfs_timer
>13835 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00
>glfs_sigwait
>13836 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.16
>glfs_memsweep
>13837 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.05
>glfs_sproc0
>
>Also I didn't find relevant messages in log files.
>Honestly, don't know what to do. Does someone know how to debug or fix
>this
>behaviour?
>
>Best regards,
>Pavel