[Gluster-users] NFS blocked for more than 120 seconds on gluster 3.7.12

Mon Jul 11 16:13:23 UTC 2016

Hi Folks,

last week I upgraded gluster from 3.7.11 to 3.7.12. Unfortunately I had two times the problem
of a task being blocked for more than 120 seconds leaving me with nothing more than a hard reset of the node!

kernel: [1705322.676270] INFO: task apache2:3092 blocked for more than 120 seconds.
kernel: [1705322.682202]       Not tainted 3.13.0-88-generic #135-Ubuntu
kernel: [1705322.682770] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: [1705322.683738] apache2         D ffff880a03d93180     0  3092   1800 0x00000000
kernel: [1705322.683761]  ffff88016f865b38 0000000000000082 ffff8809cbde8000 0000000000013180
kernel: [1705322.683769]  ffff88016f865fd8 0000000000013180 ffff8809cbde8000 ffff880a03d93a18
kernel: [1705322.683775]  ffff880a03f92d90 0000000000000002 ffffffffa029e670 ffff88016f865bb0
kernel: [1705322.683782] Call Trace:
kernel: [1705322.683978]  [<ffffffffa029e670>] ? nfs_free_request+0xb0/0xb0 [nfs]
kernel: [1705322.684043]  [<ffffffff8172e10d>] io_schedule+0x9d/0x130
kernel: [1705322.684096]  [<ffffffffa029e67e>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
kernel: [1705322.684103]  [<ffffffff8172e582>] __wait_on_bit+0x62/0x90
kernel: [1705322.684162]  [<ffffffff81617018>] ? sk_reset_timer+0x18/0x30
kernel: [1705322.684193]  [<ffffffffa029e670>] ? nfs_free_request+0xb0/0xb0 [nfs]
kernel: [1705322.684203]  [<ffffffff8172e627>] out_of_line_wait_on_bit+0x77/0x90
kernel: [1705322.684234]  [<ffffffff810adac0>] ? autoremove_wake_function+0x40/0x40
kernel: [1705322.684258]  [<ffffffffa029ea33>] nfs_wait_on_request+0x33/0x40 [nfs]
kernel: [1705322.684276]  [<ffffffffa02a3a40>] nfs_updatepage+0x150/0x660 [nfs]
kernel: [1705322.684290]  [<ffffffffa0294dfb>] nfs_write_end+0x5b/0x340 [nfs]
kernel: [1705322.684318]  [<ffffffff811522da>] generic_file_buffered_write+0x16a/0x250
kernel: [1705322.684336]  [<ffffffff81153991>] __generic_file_aio_write+0x1c1/0x3d0
kernel: [1705322.684340]  [<ffffffff81153bf8>] generic_file_aio_write+0x58/0xa0
kernel: [1705322.684354]  [<ffffffffa029406b>] nfs_file_write+0xbb/0x1d0 [nfs]
kernel: [1705322.684387]  [<ffffffff811c096a>] do_sync_write+0x5a/0x90
kernel: [1705322.684394]  [<ffffffff811c10f4>] vfs_write+0xb4/0x1f0
kernel: [1705322.684399]  [<ffffffff811c1b29>] SyS_write+0x49/0xa0
kernel: [1705322.684411]  [<ffffffff8173a4dd>] system_call_fastpath+0x1a/0x1f

Some research came up with this links:
* https://www.gluster.org/pipermail/gluster-users/2016-July/027327.html
* http://serverfault.com/questions/500222/kernel-3-8-apache2-with-wsgi-info-task-apache2-blocked-for-more-than-120-sec
* https://www.novell.com/support/kb/doc.php?id=7010287l

The gluster volume serves a home directory for apache/php-fpm and usually the server is quite busy in terms of requests.
As with 3.7.11 I did not have any problems the last few weeks, I am unsure if it is related to the 3.7.11 -> 3.7.12 upgrade.
(or is just the file system blocking?)

My setup is:
# dpkg -l | grep gluster
ii  glusterfs-client                     3.7.12-ubuntu1~trusty1                               amd64        clustered file-system (client package)
ii  glusterfs-common                     3.7.12-ubuntu1~trusty1                               amd64        GlusterFS common libraries and translator modules
ii  glusterfs-server                     3.7.12-ubuntu1~trusty1                               amd64        clustered file-system (server package)
The gluster Volume is mounted on the same host as the volume:
XXX:/gldata on /home/gldata type nfs (rw,nfsvers=3,addr=XXX,_netdev)

Are there any known race conditions with 3.7.12 and NFS?
Should I apply the vm.dirty_ratio settings mentioned in the links above?
Should I downgrade to 3.7.11 or upgrade to 3.7.13?

I appreciate any help,
THX Georg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160711/6247c143/attachment.html>