[Gluster-users] gluster 4.1.6 brick problems: 2 processes for one brick, performance problems
Hu Bert
revirii at googlemail.com
Wed Dec 12 09:41:06 UTC 2018
Hello,
we started with a gluster installation: 3.12.11. 3 servers (gluster11,
gluster12, gluster13) and 4 bricks (each hdd == brick, JBOD behind
controller) per server: bricksda1, bricksdb1, bricksdc1, bricksdd1;
full information: see here: https://pastebin.com/0ndDSstG
In the beginning everything was running fine so far. In May one hdd
(sdd on gluster13) died and got replaced; i replaced the brick and the
self-heal started, taking weeks and worsening performance. One week
after the heal had finished another hdd (sdd on gluster12) died -> did
the same again, it again took weeks, bad performance etc.
After the replace/heal now the performance on most of the bricks was
ok, but 2 have a bad performance; in short:
gluster11: no hdd change, bricksd(a|b|c) ok, bricksdd takes much
longer for requests
gluster12: 1 hdd change, all bricks with normal performance
gluster13: 1 hdd change, bricksd(a|b|c) ok, bricksdd takes much longer
for requests
We've checked (thx to Pranith and Xavi) hardware, disks speed, gluster
settings etc., but only the 2 bricksdd on gluster11+13 take much
longer (>2x) for each request, worsening the overall gluster
performance. So something must be wrong, especially with bricksdd1.
Anyone knows how to check how to investigate this?
2nd problen: during all these checks and searches we upgraded
glusterfs from 3.12.11 -> 3.12.15 and finally to 4.1.6, but the
problems didn't disappear. Well, and some additional problems came up:
this week i rebooted (kernel updates) gluster11 and gluster13 (the
ones with the "sick" bricksdd1), and for these 2 bricks 2 processes
are started, making it unavailable.
root 2118 0.1 0.0 944596 12452 ? Ssl 07:25 0:00
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l
/var/log/glusterfs/glustershd.log -S
/var/run/gluster/546621eb24596f4c.socket --xlator-option
*replicate*.node-uuid=4fdb11c3-a5af-4e18-af48-182c00b88cc8
--process-name glustershd
root 2197 0.5 0.0 540808 8672 ? Ssl 07:25 0:00
/usr/sbin/glusterfsd -s gluster13 --volfile-id
shared.gluster13.gluster-bricksdd1_new-shared -p
/var/run/gluster/vols/shared/gluster13-gluster-bricksdd1_new-shared.pid
-S /var/run/gluster/23f68b171e2f2c9e.socket --brick-name
/gluster/bricksdd1_new/shared -l
/var/log/glusterfs/bricks/gluster-bricksdd1_new-shared.log
--xlator-option
*-posix.glusterd-uuid=4fdb11c3-a5af-4e18-af48-182c00b88cc8
--process-name brick --brick-port 49155 --xlator-option
shared-server.listen-port=49155
In the brick log for bricksdd1_new i see:
[2018-12-12 06:20:41.817978] I [rpcsvc.c:2052:rpcsvc_spawn_threads]
0-rpc-service: spawned 1 threads for program 'GlusterFS 3.3'; total
count:1
[2018-12-12 06:20:41.818048] I [rpcsvc.c:2052:rpcsvc_spawn_threads]
0-rpc-service: spawned 1 threads for program 'GlusterFS 4.x v1'; total
count:1
A simple 'gluster volume start shared force' ended up in having 4
processes for that brick. I had to do the following twice:
- kill the 2 brick processes
- gluster volume start shared force
After the 2nd try there was only 1 brick process left. Heal started
etc. Has anyone seen that there are 2 processes for one brick? I
followed the upgrade guide
(https://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_4.1/),
but is there anything one can do?
3rd problem: i've seen some additional issues that on the mounted
volume some clients can't see some directories, even if they are
there, but other clients do. Example:
client1: ls /data/repository/shared/public/staticmap/118/
ls: cannot access '/data/repository/shared/public/staticmap/118/408':
No such file or directory
238 255 272 289 306 323 340 357 374 391 408 478 [...]
client1: ls /data/repository/shared/public/staticmap/118/408/
ls: cannot access '/data/repository/shared/public/staticmap/118/408/':
No such file or directory
client2: ls /data/repository/shared/public/staticmap/118/408/
118408013 118408051 118408260 118408285 118408334 118408399 [...]
mount options: nothing special. from /etc/fstab:
gluster13:/shared /shared glusterfs defaults,_netdev 0 0
By doing a umount/mount the problem disappears.
umount /data/repository/shared ; mount -t glusterfs gluster12:/shared
/data/repository/shared
Has anyone had or seen such problems?
Thx
Hubert
More information about the Gluster-users
mailing list