[Gluster-users] One of cluster work super slow (v6.8)

Thu Jul 16 08:19:45 UTC 2020

Sorry for the delay. Somehow Gmail decided to put almost all email from
this list to spam.
Anyway, yes, I checked the processes. Gluster processes are in 'R' state,
the others in 'S' state.
You can find 'top -H' output in the first message.
We're running glusterfs 6.8 on CentOS 7.8. Linux kernel 4.19.

Thanks.

вт, 23 июн. 2020 г. в 21:49, Strahil Nikolov <hunter86_bg at yahoo.com>:

> What  is the OS and it's version ?
> I have seen similar behaviour (different workload)  on RHEL 7.6 (and
> below).
>
> Have you checked  what processes are in 'R' or 'D' state on  st2a  ?
>
> Best Regards,
> Strahil Nikolov
>
> На 23 юни 2020 г. 19:31:12 GMT+03:00, Pavel Znamensky <
> kompastver at gmail.com> написа:
> >Hi all,
> >There's something strange with one of our clusters and glusterfs
> >version
> >6.8: it's quite slow and one node is overloaded.
> >This is distributed cluster with four servers with the same
> >specs/OS/versions:
> >
> >Volume Name: st2
> >Type: Distributed-Replicate
> >Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45
> >Status: Started
> >Snapshot Count: 0
> >Number of Bricks: 2 x 2 = 4
> >Transport-type: tcp
> >Bricks:
> >Brick1: st2a:/vol3/st2
> >Brick2: st2b:/vol3/st2
> >Brick3: st2c:/vol3/st2
> >Brick4: st2d:/vol3/st2
> >Options Reconfigured:
> >cluster.rebal-throttle: aggressive
> >nfs.disable: on
> >performance.readdir-ahead: off
> >transport.address-family: inet6
> >performance.quick-read: off
> >performance.cache-size: 1GB
> >performance.io-cache: on
> >performance.io-thread-count: 16
> >cluster.data-self-heal-algorithm: full
> >network.ping-timeout: 20
> >server.event-threads: 2
> >client.event-threads: 2
> >cluster.readdir-optimize: on
> >performance.read-ahead: off
> >performance.parallel-readdir: on
> >cluster.self-heal-daemon: enable
> >storage.health-check-timeout: 20
> >
> >op.version for this cluster remains 50400
> >
> >st2a is a replica for the st2b and st2c is a replica for st2d.
> >All our 50 clients mount this volume using FUSE and in contrast with
> >other
> >our cluster this one works terrible slow.
> >Interesting thing here is that there are very low HDDs and network
> >utilization from one hand and quite overloaded server from another
> >hand.
> >Also, there are no files which should be healed according to `gluster
> >volume heal st2 info`.
> >Load average across servers:
> >st2a:
> >load average: 28,73, 26,39, 27,44
> >st2b:
> >load average: 0,24, 0,46, 0,76
> >st2c:
> >load average: 0,13, 0,20, 0,27
> >st2d:
> >load average:2,93, 2,11, 1,50
> >
> >If we stop glusterfs on st2a server the cluster will work as fast as we
> >expected.
> >Previously the cluster worked on a version 5.x and there were no such
> >problems.
> >
> >Interestingly, that almost all CPU usage on st2a generates by a
> >"system"
> >load.
> >The most CPU intensive process is glusterfsd.
> >`top -H` for glusterfsd process shows this:
> >
> >PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+
> >COMMAND
> >
> >13894 root      20   0 2172892  96488   9056 R 74,0  0,1 122:09.14
> >glfs_iotwr00a
> >13888 root      20   0 2172892  96488   9056 R 73,7  0,1 121:38.26
> >glfs_iotwr004
> >13891 root      20   0 2172892  96488   9056 R 73,7  0,1 121:53.83
> >glfs_iotwr007
> >13920 root      20   0 2172892  96488   9056 R 73,0  0,1 122:11.27
> >glfs_iotwr00f
> >13897 root      20   0 2172892  96488   9056 R 68,3  0,1 121:09.82
> >glfs_iotwr00d
> >13896 root      20   0 2172892  96488   9056 R 68,0  0,1 122:03.99
> >glfs_iotwr00c
> >13868 root      20   0 2172892  96488   9056 R 67,7  0,1 122:42.55
> >glfs_iotwr000
> >13889 root      20   0 2172892  96488   9056 R 67,3  0,1 122:17.02
> >glfs_iotwr005
> >13887 root      20   0 2172892  96488   9056 R 67,0  0,1 122:29.88
> >glfs_iotwr003
> >13885 root      20   0 2172892  96488   9056 R 65,0  0,1 122:04.85
> >glfs_iotwr001
> >13892 root      20   0 2172892  96488   9056 R 55,0  0,1 121:15.23
> >glfs_iotwr008
> >13890 root      20   0 2172892  96488   9056 R 54,7  0,1 121:27.88
> >glfs_iotwr006
> >13895 root      20   0 2172892  96488   9056 R 54,0  0,1 121:28.35
> >glfs_iotwr00b
> >13893 root      20   0 2172892  96488   9056 R 53,0  0,1 122:23.12
> >glfs_iotwr009
> >13898 root      20   0 2172892  96488   9056 R 52,0  0,1 122:30.67
> >glfs_iotwr00e
> >13886 root      20   0 2172892  96488   9056 R 41,3  0,1 121:26.97
> >glfs_iotwr002
> >13878 root      20   0 2172892  96488   9056 S  1,0  0,1   1:20.34
> >glfs_rpcrqhnd
> >13840 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.54
> >glfs_epoll000
> >13841 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.14
> >glfs_epoll001
> >13877 root      20   0 2172892  96488   9056 S  0,3  0,1   1:20.02
> >glfs_rpcrqhnd
> >13833 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00
> >glusterfsd
> >13834 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.14
> >glfs_timer
> >13835 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00
> >glfs_sigwait
> >13836 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.16
> >glfs_memsweep
> >13837 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.05
> >glfs_sproc0
> >
> >Also I didn't find relevant messages in log files.
> >Honestly, don't know what to do. Does someone know how to debug or fix
> >this
> >behaviour?
> >
> >Best regards,
> >Pavel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200716/1be0173a/attachment.html>