[Gluster-users] Disperse Volume & io-cache

Tue May 12 19:49:28 UTC 2015

Hello,

I am using a GlusterFS disperse volume to host QEMU images. 

Previously, I had used a distribute-replicate volume, but the disperse volume seems like it would be a better fit for us.

I have created a volume with 11 bricks (3 redundancy).

During testing, we’ve encountered ongoing problems. Mainly, there appear to be glusterfs hangs that seem severe. We get many log messages like the following:
INFO: task glusterfs:4359 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-37-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfs     D ffff8803130d8040     0  4359      1    0 0x00000000
 ffff88031311db98 0000000000000086 0000000000000000 ffffffff00000000
 ffff88033fc0ad00 ffff8803130d8040 ffff88033e6eaf80 ffff88002820ffb0
 00016cebb4c94040 0000000000000006 0000000117e5c6b9 000000000000091c
Call Trace:
 [<ffffffff81139590>] ? sync_page+0x0/0x50
 [<ffffffff815616c3>] io_schedule+0x73/0xc0
 [<ffffffff811395cb>] sync_page+0x3b/0x50
 [<ffffffff8156249b>] __wait_on_bit_lock+0x5b/0xc0
 [<ffffffff81139567>] __lock_page+0x67/0x70
 [<ffffffff810a6910>] ? wake_bit_function+0x0/0x50
 [<ffffffff8115323b>] invalidate_inode_pages2_range+0x11b/0x380
 [<ffffffffa016da80>] ? fuse_inode_eq+0x0/0x20 [fuse]
 [<ffffffff811ccb54>] ? ifind+0x74/0xd0
 [<ffffffffa016fa10>] fuse_reverse_inval_inode+0x70/0xa0 [fuse]
 [<ffffffffa01629ae>] fuse_dev_do_write+0x50e/0x6d0 [fuse]
 [<ffffffff811ad81e>] ? do_sync_read+0xfe/0x140
 [<ffffffffa0162ed9>] fuse_dev_write+0x69/0x80 [fuse]
 [<ffffffff811ad6cc>] do_sync_write+0xec/0x140
 [<ffffffff811adf01>] vfs_write+0xa1/0x190
 [<ffffffff811ae25a>] sys_write+0x4a/0x90
 [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b

This plays havoc on the virtual machines. 

In addition to this, read-write performance would bog down more quickly than would be expected, even under light load. The bricks are distributed among 4 servers connected by bonded gigabit ethernet (LACP). For our application, the slow downs are not a major problem, but they are an irritation.

I have been trying different iterations of volume options to try and address this, and happened to find an option that seems to have resolved both issues. On a whim, I disabled performance.io-cache . Client access to the volume seems to be close to wire speed now, at least for large file read-writes. 

Reading the documentation, it seems like performance.io-cache would not be of huge benefit to our workload, but it seems strange that it would cause all of the various issues we have been having. Is this expected behavior for disperse volumes? We had planned to transition another volume to the disperse configuration, and I’d like to have a good handle on what options are good/bad.

BTW: The options selected are really just based upon trial and error, with some not-very-rigorous testing.

My volume info is below:
Volume Name: oort
Type: Disperse
Volume ID: 9b8702b2-3901-4cdf-b839-b17a06017f66
Status: Started
Number of Bricks: 1 x (8 + 3) = 11
Transport-type: tcp
Bricks:
Brick1: XXXXX:/export/oort-brick-1/brick
Brick2: XXXXX:/export/oort-brick-2/brick
Brick3: XXXXX:/export/oort-brick-3/brick
Brick4: XXXXX:/export/oort-brick-4/brick
Brick5: XXXXX:/export/oort-brick-5/brick
Brick6: XXXXX:/export/oort-brick-6/brick
Brick7: XXXXX:/export/oort-brick-7/brick
Brick8: XXXXX:/export/oort-brick-8/brick
Brick9: XXXXX:/export/oort-brick-9/brick
Brick10: XXXXX:/export/oort-brick-10/brick
Brick11: XXXXX:/export/oort-brick-11/brick
Options Reconfigured:
transport.keepalive: on
server.allow-insecure: on
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: on
cluster.eager-lock: on
cluster.readdir-optimize: on
features.lock-heal: on
performance.stat-prefetch: on
performance.cache-size: 128MB
performance.io-thread-count: 64
performance.read-ahead: off
performance.write-behind: on
performance.io-cache: off
performance.quick-read: off
performance.flush-behind: on
performance.write-behind-window-size: 2MB

Thank you for any help you can provide.

Regards,

Sherwin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150512/366b70d6/attachment.html>