gluster 4.1.6 brick problems: 2 processes for one brick, performance problems

Hu Bert revirii at googlemail.com
Wed Dec 12 09:41:06 UTC 2018


we started with a gluster installation: 3.12.11. 3 servers (gluster11,
gluster12, gluster13) and 4 bricks (each hdd == brick, JBOD behind
controller) per server: bricksda1, bricksdb1, bricksdc1, bricksdd1;
full information: see here: https://pastebin.com/0ndDSstG

In the beginning everything was running fine so far. In May one hdd
(sdd on gluster13) died and got replaced; i replaced the brick and the
self-heal started, taking weeks and worsening performance. One week
after the heal had finished another hdd (sdd on gluster12) died -> did
the same again, it again took weeks, bad performance etc.

After the replace/heal now the performance on most of the bricks was
ok, but 2 have a bad performance; in short:

gluster11: no hdd change, bricksd(a|b|c) ok, bricksdd takes much
longer for requests
gluster12: 1 hdd change, all bricks with normal performance
gluster13: 1 hdd change, bricksd(a|b|c) ok, bricksdd takes much longer
for requests

We've checked (thx to Pranith and Xavi) hardware, disks speed, gluster
settings etc., but only the 2 bricksdd on gluster11+13 take much
longer (>2x) for each request, worsening the overall gluster
performance. So something must be wrong, especially with bricksdd1.
Anyone knows how to check how to investigate this?

2nd problen: during all these checks and searches we upgraded
glusterfs from 3.12.11 -> 3.12.15 and finally to 4.1.6, but the
problems didn't disappear. Well, and some additional problems came up:
this week i rebooted (kernel updates) gluster11 and gluster13 (the
ones with the "sick" bricksdd1), and for these 2 bricks 2 processes
are started, making it unavailable.

root      2118  0.1  0.0 944596 12452 ?        Ssl  07:25   0:00
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l
/var/log/glusterfs/glustershd.log -S
/var/run/gluster/546621eb24596f4c.socket --xlator-option
--process-name glustershd
root      2197  0.5  0.0 540808  8672 ?        Ssl  07:25   0:00
/usr/sbin/glusterfsd -s gluster13 --volfile-id
shared.gluster13.gluster-bricksdd1_new-shared -p
-S /var/run/gluster/23f68b171e2f2c9e.socket --brick-name
/gluster/bricksdd1_new/shared -l
--process-name brick --brick-port 49155 --xlator-option

In the brick log for bricksdd1_new i see:

[2018-12-12 06:20:41.817978] I [rpcsvc.c:2052:rpcsvc_spawn_threads]
0-rpc-service: spawned 1 threads for program 'GlusterFS 3.3'; total
[2018-12-12 06:20:41.818048] I [rpcsvc.c:2052:rpcsvc_spawn_threads]
0-rpc-service: spawned 1 threads for program 'GlusterFS 4.x v1'; total

A simple 'gluster volume start shared force' ended up in having 4
processes for that brick. I had to do the following twice:

- kill the 2 brick processes
- gluster volume start shared force

After the 2nd try there was only 1 brick process left. Heal started
etc. Has anyone seen that there are 2 processes for one brick? I
followed the upgrade guide
but is there anything one can do?

3rd problem: i've seen some additional issues that on the mounted
volume some clients can't see some directories, even if they are
there, but other clients do. Example:

client1: ls /data/repository/shared/public/staticmap/118/
ls: cannot access '/data/repository/shared/public/staticmap/118/408':
No such file or directory
238  255  272  289  306  323  340  357  374  391  408 478 [...]

client1: ls /data/repository/shared/public/staticmap/118/408/
ls: cannot access '/data/repository/shared/public/staticmap/118/408/':
No such file or directory

client2: ls /data/repository/shared/public/staticmap/118/408/
118408013  118408051  118408260  118408285  118408334  118408399 [...]

mount options: nothing special. from /etc/fstab:

gluster13:/shared /shared glusterfs defaults,_netdev 0 0

By doing a umount/mount the problem disappears.
umount /data/repository/shared ; mount -t glusterfs gluster12:/shared

Has anyone had or seen such problems?


