[Gluster-users] One of cluster work super slow (v6.8)

Pavel Znamensky kompastver at gmail.com
Tue Jun 23 16:31:12 UTC 2020


Hi all,
There's something strange with one of our clusters and glusterfs version
6.8: it's quite slow and one node is overloaded.
This is distributed cluster with four servers with the same
specs/OS/versions:

Volume Name: st2
Type: Distributed-Replicate
Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: st2a:/vol3/st2
Brick2: st2b:/vol3/st2
Brick3: st2c:/vol3/st2
Brick4: st2d:/vol3/st2
Options Reconfigured:
cluster.rebal-throttle: aggressive
nfs.disable: on
performance.readdir-ahead: off
transport.address-family: inet6
performance.quick-read: off
performance.cache-size: 1GB
performance.io-cache: on
performance.io-thread-count: 16
cluster.data-self-heal-algorithm: full
network.ping-timeout: 20
server.event-threads: 2
client.event-threads: 2
cluster.readdir-optimize: on
performance.read-ahead: off
performance.parallel-readdir: on
cluster.self-heal-daemon: enable
storage.health-check-timeout: 20

op.version for this cluster remains 50400

st2a is a replica for the st2b and st2c is a replica for st2d.
All our 50 clients mount this volume using FUSE and in contrast with other
our cluster this one works terrible slow.
Interesting thing here is that there are very low HDDs and network
utilization from one hand and quite overloaded server from another hand.
Also, there are no files which should be healed according to `gluster
volume heal st2 info`.
Load average across servers:
st2a:
load average: 28,73, 26,39, 27,44
st2b:
load average: 0,24, 0,46, 0,76
st2c:
load average: 0,13, 0,20, 0,27
st2d:
load average:2,93, 2,11, 1,50

If we stop glusterfs on st2a server the cluster will work as fast as we
expected.
Previously the cluster worked on a version 5.x and there were no such
problems.

Interestingly, that almost all CPU usage on st2a generates by a "system"
load.
The most CPU intensive process is glusterfsd.
`top -H` for glusterfsd process shows this:

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND

13894 root      20   0 2172892  96488   9056 R 74,0  0,1 122:09.14
glfs_iotwr00a
13888 root      20   0 2172892  96488   9056 R 73,7  0,1 121:38.26
glfs_iotwr004
13891 root      20   0 2172892  96488   9056 R 73,7  0,1 121:53.83
glfs_iotwr007
13920 root      20   0 2172892  96488   9056 R 73,0  0,1 122:11.27
glfs_iotwr00f
13897 root      20   0 2172892  96488   9056 R 68,3  0,1 121:09.82
glfs_iotwr00d
13896 root      20   0 2172892  96488   9056 R 68,0  0,1 122:03.99
glfs_iotwr00c
13868 root      20   0 2172892  96488   9056 R 67,7  0,1 122:42.55
glfs_iotwr000
13889 root      20   0 2172892  96488   9056 R 67,3  0,1 122:17.02
glfs_iotwr005
13887 root      20   0 2172892  96488   9056 R 67,0  0,1 122:29.88
glfs_iotwr003
13885 root      20   0 2172892  96488   9056 R 65,0  0,1 122:04.85
glfs_iotwr001
13892 root      20   0 2172892  96488   9056 R 55,0  0,1 121:15.23
glfs_iotwr008
13890 root      20   0 2172892  96488   9056 R 54,7  0,1 121:27.88
glfs_iotwr006
13895 root      20   0 2172892  96488   9056 R 54,0  0,1 121:28.35
glfs_iotwr00b
13893 root      20   0 2172892  96488   9056 R 53,0  0,1 122:23.12
glfs_iotwr009
13898 root      20   0 2172892  96488   9056 R 52,0  0,1 122:30.67
glfs_iotwr00e
13886 root      20   0 2172892  96488   9056 R 41,3  0,1 121:26.97
glfs_iotwr002
13878 root      20   0 2172892  96488   9056 S  1,0  0,1   1:20.34
glfs_rpcrqhnd
13840 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.54
glfs_epoll000
13841 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.14
glfs_epoll001
13877 root      20   0 2172892  96488   9056 S  0,3  0,1   1:20.02
glfs_rpcrqhnd
13833 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00
glusterfsd
13834 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.14
glfs_timer
13835 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00
glfs_sigwait
13836 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.16
glfs_memsweep
13837 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.05
glfs_sproc0

Also I didn't find relevant messages in log files.
Honestly, don't know what to do. Does someone know how to debug or fix this
behaviour?

Best regards,
Pavel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200623/20f36036/attachment.html>


More information about the Gluster-users mailing list