[Gluster-users] Problem After adding Bricks

Fri May 24 18:19:10 UTC 2013

Hello, I have run into some performance issues after adding bricks to
a 3.3.1 volume. Basically I am seeing very high CPU usage and
extremely degraded performance. I started a re-balance but stopped it
after a couple days. The logs have a lot of entries for split-brain as
well as "Non Blocking entrylks failed for". For some of the
directories on the client doing an ls will show multiple entires for
the same directory ( ls below ).  I am wondering if it is just
spinning trying to heal itself? I have been able to fix some of these
entries by removing gfid files, stat-ing etc, however I feel I may be
just making matters worse.  So far the most permanent fix has been to
rsync the files out of the bricks, remove the directories, and copy it
back in through the normal fuse mount, but that will take quite some
time given the performance problems.

Has Anyone seen this behavior before or know any possible fixes?

Thanks in Advance,
Scott

root at ftpscan02:~# ls -lrt /mnt/glusterfs/ftp_scan/199268/
ls: cannot access /mnt/glusterfs/ftp_scan/199268/mirror: Input/output error
ls: cannot access /mnt/glusterfs/ftp_scan/199268/mirror: Input/output error
ls: cannot access /mnt/glusterfs/ftp_scan/199268/mirror: Input/output error
ls: cannot access /mnt/glusterfs/ftp_scan/199268/mirror: Input/output error
ls: cannot access /mnt/glusterfs/ftp_scan/199268/mirror_trash: No data available
total 765
?????????? ? ?      ?           ?            ? mirror_trash
?????????? ? ?      ?           ?            ? mirror
?????????? ? ?      ?           ?            ? mirror
?????????? ? ?      ?           ?            ? mirror
?????????? ? ?      ?           ?            ? mirror
-rw------- 1 torque torque  90287 May 18 20:47 cache
-rw------- 1 torque torque 667180 May 18 23:35 file_mapping
drwx------ 3 torque torque   8192 May 23 11:31 mirror_trash
drwx------ 3 torque torque   8192 May 23 11:31 mirror_trash
drwx------ 3 torque torque   8192 May 23 11:31 mirror_trash

Volume Name: gv01
Type: Distributed-Replicate
Volume ID: 03cf79bd-c5d8-467d-9f31-6c3c40dd94e2
Status: Started
Number of Bricks: 11 x 2 = 22
Transport-type: tcp
Bricks:
Brick1: fs01:/bricks/b01
Brick2: fs02:/bricks/b01
Brick3: fs01:/bricks/b02
Brick4: fs02:/bricks/b02
Brick5: fs01:/bricks/b03
Brick6: fs02:/bricks/b03
Brick7: fs01:/bricks/b04
Brick8: fs02:/bricks/b04
Brick9: fs01:/bricks/b05
Brick10: fs02:/bricks/b05
Brick11: fs01:/bricks/b06
Brick12: fs02:/bricks/b06
Brick13: fs01:/bricks/b07
Brick14: fs02:/bricks/b07
Brick15: fs01:/bricks/b08
Brick16: fs02:/bricks/b08
Brick17: fs01:/bricks/b09
Brick18: fs02:/bricks/b09
Brick19: fs01:/bricks/b10
Brick20: fs02:/bricks/b10
Brick21: fs01:/bricks/b11
Brick22: fs02:/bricks/b11
Options Reconfigured:
performance.quick-read: off
performance.io-cache: off
performance.stat-prefetch: off
performance.write-behind: off
performance.write-behind-window-size: 1MB
performance.flush-behind: off
nfs.disable: off
performance.cache-size: 16MB
performance.io-thread-count: 8
performance.cache-refresh-timeout: 10
diagnostics.client-log-level: ERROR
performance.read-ahead: on
cluster.data-self-heal: on
nfs.register-with-portmap: on

Top output :
top - 13:00:36 up 1 day, 16:02,  3 users,  load average: 38.04, 37.83, 37.68
Tasks: 183 total,   2 running, 181 sleeping,   0 stopped,   0 zombie
Cpu0  : 35.8%us, 49.2%sy,  0.0%ni,  0.0%id,  5.4%wa,  0.0%hi,  9.7%si,  0.0%st
Cpu1  : 32.8%us, 59.5%sy,  0.0%ni,  1.7%id,  6.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 37.2%us, 55.1%sy,  0.0%ni,  1.7%id,  5.6%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  : 36.5%us, 56.4%sy,  0.0%ni,  2.0%id,  5.1%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 36.8%us, 54.2%sy,  0.0%ni,  1.0%id,  8.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 36.9%us, 53.0%sy,  0.0%ni,  2.0%id,  8.1%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 38.6%us, 54.0%sy,  0.0%ni,  1.3%id,  5.7%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu7  : 34.3%us, 59.3%sy,  0.0%ni,  2.4%id,  4.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  : 36.5%us, 55.7%sy,  0.0%ni,  3.4%id,  4.4%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  : 34.7%us, 59.2%sy,  0.0%ni,  0.7%id,  5.4%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 : 34.9%us, 58.2%sy,  0.0%ni,  1.7%id,  5.1%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 : 38.4%us, 55.0%sy,  0.0%ni,  1.0%id,  5.6%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16405712k total, 16310088k used,    95624k free, 12540824k buffers
Swap:  1999868k total,     9928k used,  1989940k free,   656604k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2460 root      20   0  391m  38m 1616 S  250  0.2   4160:51 glusterfsd
 2436 root      20   0  392m  40m 1624 S  243  0.3   4280:26 glusterfsd
 2442 root      20   0  391m  39m 1620 S  187  0.2   3933:46 glusterfsd
 2454 root      20   0  391m  36m 1620 S  118  0.2   3870:23 glusterfsd
 2448 root      20   0  391m  38m 1624 S  110  0.2   3720:50 glusterfsd
 2472 root      20   0  393m  42m 1624 S  105  0.3 319:25.80 glusterfsd
 2466 root      20   0  391m  37m 1556 R   51  0.2   3407:37 glusterfsd
 2484 root      20   0  392m  40m 1560 S   10  0.3 268:51.71 glusterfsd
 2490 root      20   0  392m  41m 1616 S   10  0.3 252:44.63 glusterfsd
 2496 root      20   0  392m  41m 1544 S   10  0.3 262:26.80 glusterfsd
 2478 root      20   0  392m  41m 1536 S    9  0.3 219:15.17 glusterfsd
 2508 root      20   0  585m 365m 1364 S    5  2.3  46:20.31 glusterfs
 3081 root      20   0  407m 101m 1676 S    2  0.6 239:36.76 glusterfs