[Gluster-devel] Shall we revert quota-anon-fd.t?
Vijay Bellur
vbellur at redhat.com
Wed Jun 11 08:01:04 UTC 2014
On 06/11/2014 10:45 AM, Pranith Kumar Karampuri wrote:
>
> On 06/11/2014 09:45 AM, Vijay Bellur wrote:
>> On 06/11/2014 08:21 AM, Pranith Kumar Karampuri wrote:
>>> hi,
>>> I see that quota-anon-fd.t is causing too many spurious failures. I
>>> think we should revert it and raise a bug so that it can be fixed and
>>> committed again along with the fix.
>>>
>>
>> I think we can do that. The problem here is stemming from the issue
>> that nfs can deadlock when we have client and servers on the same node
>> with system memory utilization being on the higher side. We also need
>> to look into other nfs tests to determine if there are similar
>> possibilities.
>
> I doubt it is because of that, there are so many nfs mount tests,
I have been following this problem closely on b.g.o. This backtrace does
indicate dd being hung:
INFO: task dd:6039 blocked for more than 120 seconds.
Not tainted 2.6.32-431.3.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dd D ffff880028100840 0 6039 5704 0x00000080
ffff8801f843faa8 0000000000000286 ffff8801ffffffff 01eff88bb6f58e28
ffff8801db96bb80 ffff8801f8213590 00000000036c74dc ffffffffac6f4edf
ffff8801faf11af8 ffff8801f843ffd8 000000000000fbc8 ffff8801faf11af8
Call Trace:
[<ffffffff810a70b1>] ? ktime_get_ts+0xb1/0xf0
[<ffffffff8111f940>] ? sync_page+0x0/0x50
[<ffffffff815280b3>] io_schedule+0x73/0xc0
[<ffffffff8111f97d>] sync_page+0x3d/0x50
[<ffffffff81528b7f>] __wait_on_bit+0x5f/0x90
[<ffffffff8111fbb3>] wait_on_page_bit+0x73/0x80
[<ffffffff8109b330>] ? wake_bit_function+0x0/0x50
[<ffffffff81135c05>] ? pagevec_lookup_tag+0x25/0x40
[<ffffffff8111ffdb>] wait_on_page_writeback_range+0xfb/0x190
[<ffffffff811201a8>] filemap_write_and_wait_range+0x78/0x90
[<ffffffff811baa4e>] vfs_fsync_range+0x7e/0x100
[<ffffffff811bab1b>] generic_write_sync+0x4b/0x50
[<ffffffff81122056>] generic_file_aio_write+0xe6/0x100
[<ffffffffa042f20e>] nfs_file_write+0xde/0x1f0 [nfs]
[<ffffffff81188c8a>] do_sync_write+0xfa/0x140
[<ffffffff8152a825>] ? page_fault+0x25/0x30
[<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8128ec6f>] ? __clear_user+0x3f/0x70
[<ffffffff8128ec51>] ? __clear_user+0x21/0x70
[<ffffffff812263d6>] ? security_file_permission+0x16/0x20
[<ffffffff81188f88>] vfs_write+0xb8/0x1a0
[<ffffffff81189881>] sys_write+0x51/0x90
[<ffffffff810e1e6e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
I have seen dd being in uninterruptible sleep on b.g.o. There are also
instances [1] where anon-fd-nfs has run for close to 6000+ seconds. This
definitely points to the nfs deadlock.
> only
> this one keeps failing for the past 2-3 days.
It is a function of the system memory consumption and what oom killer
decides to kill. If NFS or a glusterfsd process gets killed, then the
test unit will fail. If the test can continue till the system reclaims
memory, it can possibly succeed.
However, there could be other possibilities and we need to root cause
them as well.
-Vijay
[1] http://build.gluster.org/job/regression/4783/console
More information about the Gluster-devel
mailing list