<div dir="ltr">Hi all,<div>There&#39;s something strange with one of our clusters and glusterfs version 6.8: it&#39;s quite slow and one node is overloaded.</div><div>This is distributed cluster with four servers with the same specs/OS/versions:</div><div><br></div><div><font face="monospace">Volume Name: st2<br></font></div><div><font face="monospace">Type: Distributed-Replicate<br>Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45<br>Status: Started<br>Snapshot Count: 0<br>Number of Bricks: 2 x 2 = 4<br>Transport-type: tcp<br>Bricks:<br>Brick1: st2a:/vol3/st2<br>Brick2: st2b:/vol3/st2<br>Brick3: st2c:/vol3/st2<br>Brick4: st2d:/vol3/st2<br>Options Reconfigured:<br>cluster.rebal-throttle: aggressive<br>nfs.disable: on<br>performance.readdir-ahead: off<br>transport.address-family: inet6<br>performance.quick-read: off<br>performance.cache-size: 1GB<br>performance.io-cache: on<br>performance.io-thread-count: 16<br>cluster.data-self-heal-algorithm: full<br>network.ping-timeout: 20<br>server.event-threads: 2<br>client.event-threads: 2<br>cluster.readdir-optimize: on<br>performance.read-ahead: off<br>performance.parallel-readdir: on<br>cluster.self-heal-daemon: enable<br>storage.health-check-timeout: 20</font></div><div><font face="monospace"><br></font></div>op.version for this cluster remains 50400<div><br></div>st2a is a replica for the st2b and st2c is a replica for st2d.<div>All our 50 clients mount this volume using FUSE and in contrast with other our cluster this one works terrible slow.</div><div>Interesting thing here is that there are very low HDDs and network utilization from one hand and quite overloaded server from another hand.</div><div>Also, there are no files which should be healed according to `gluster volume heal st2 info`.</div><div>Load average across servers:</div><div>st2a:<br>load average: 28,73, 26,39, 27,44<br>st2b:<br>load average: 0,24, 0,46, 0,76<br>st2c:<br>load average: 0,13, 0,20, 0,27<br>st2d:<br>load average:2,93, 2,11, 1,50<br></div><div><br></div><div>If we stop glusterfs on st2a server the cluster will work as fast as we expected.</div><div>Previously the cluster worked on a version 5.x and there were no such problems.</div><div><br></div><div>Interestingly, that almost all CPU usage on st2a generates by a &quot;system&quot; load.</div><div>The most CPU intensive process is glusterfsd.</div><div>`top -H` for glusterfsd process shows this:</div><div><br></div><div><font face="monospace">PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                        <br>13894 root      20   0 2172892  96488   9056 R 74,0  0,1 122:09.14 glfs_iotwr00a <br>13888 root      20   0 2172892  96488   9056 R 73,7  0,1 121:38.26 glfs_iotwr004 <br>13891 root      20   0 2172892  96488   9056 R 73,7  0,1 121:53.83 glfs_iotwr007 <br>13920 root      20   0 2172892  96488   9056 R 73,0  0,1 122:11.27 glfs_iotwr00f <br>13897 root      20   0 2172892  96488   9056 R 68,3  0,1 121:09.82 glfs_iotwr00d <br>13896 root      20   0 2172892  96488   9056 R 68,0  0,1 122:03.99 glfs_iotwr00c <br>13868 root      20   0 2172892  96488   9056 R 67,7  0,1 122:42.55 glfs_iotwr000 <br>13889 root      20   0 2172892  96488   9056 R 67,3  0,1 122:17.02 glfs_iotwr005 <br>13887 root      20   0 2172892  96488   9056 R 67,0  0,1 122:29.88 glfs_iotwr003 <br>13885 root      20   0 2172892  96488   9056 R 65,0  0,1 122:04.85 glfs_iotwr001 <br>13892 root      20   0 2172892  96488   9056 R 55,0  0,1 121:15.23 glfs_iotwr008 <br>13890 root      20   0 2172892  96488   9056 R 54,7  0,1 121:27.88 glfs_iotwr006 <br>13895 root      20   0 2172892  96488   9056 R 54,0  0,1 121:28.35 glfs_iotwr00b <br>13893 root      20   0 2172892  96488   9056 R 53,0  0,1 122:23.12 glfs_iotwr009 <br>13898 root      20   0 2172892  96488   9056 R 52,0  0,1 122:30.67 glfs_iotwr00e <br>13886 root      20   0 2172892  96488   9056 R 41,3  0,1 121:26.97 glfs_iotwr002<br>13878 root      20   0 2172892  96488   9056 S  1,0  0,1   1:20.34 glfs_rpcrqhnd <br>13840 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.54 glfs_epoll000 <br>13841 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.14 glfs_epoll001 <br>13877 root      20   0 2172892  96488   9056 S  0,3  0,1   1:20.02 glfs_rpcrqhnd <br>13833 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00 glusterfsd    <br>13834 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.14 glfs_timer    <br>13835 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00 glfs_sigwait  <br>13836 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.16 glfs_memsweep <br>13837 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.05 glfs_sproc0      </font><br></div><div><br></div><div>Also I didn&#39;t find relevant messages in log files.</div><div>Honestly, don&#39;t know what to do. Does someone know how to debug or fix this behaviour?</div><div><br></div><div>Best regards,</div><div>Pavel</div><div><br></div><div><br></div></div>