[Gluster-devel] Shall we revert quota-anon-fd.t?
Niels de Vos
ndevos at redhat.com
Wed Jun 11 10:58:46 UTC 2014
On Wed, Jun 11, 2014 at 01:31:04PM +0530, Vijay Bellur wrote:
> On 06/11/2014 10:45 AM, Pranith Kumar Karampuri wrote:
> >
> >On 06/11/2014 09:45 AM, Vijay Bellur wrote:
> >>On 06/11/2014 08:21 AM, Pranith Kumar Karampuri wrote:
> >>>hi,
> >>> I see that quota-anon-fd.t is causing too many spurious failures. I
> >>>think we should revert it and raise a bug so that it can be fixed and
> >>>committed again along with the fix.
> >>>
> >>
> >>I think we can do that. The problem here is stemming from the issue
> >>that nfs can deadlock when we have client and servers on the same node
> >>with system memory utilization being on the higher side. We also need
> >>to look into other nfs tests to determine if there are similar
> >>possibilities.
> >
> >I doubt it is because of that, there are so many nfs mount tests,
>
> I have been following this problem closely on b.g.o. This backtrace
> does indicate dd being hung:
>
> INFO: task dd:6039 blocked for more than 120 seconds.
> Not tainted 2.6.32-431.3.1.el6.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> dd D ffff880028100840 0 6039 5704 0x00000080
> ffff8801f843faa8 0000000000000286 ffff8801ffffffff 01eff88bb6f58e28
> ffff8801db96bb80 ffff8801f8213590 00000000036c74dc ffffffffac6f4edf
> ffff8801faf11af8 ffff8801f843ffd8 000000000000fbc8 ffff8801faf11af8
> Call Trace:
> [<ffffffff810a70b1>] ? ktime_get_ts+0xb1/0xf0
> [<ffffffff8111f940>] ? sync_page+0x0/0x50
> [<ffffffff815280b3>] io_schedule+0x73/0xc0
> [<ffffffff8111f97d>] sync_page+0x3d/0x50
> [<ffffffff81528b7f>] __wait_on_bit+0x5f/0x90
> [<ffffffff8111fbb3>] wait_on_page_bit+0x73/0x80
> [<ffffffff8109b330>] ? wake_bit_function+0x0/0x50
> [<ffffffff81135c05>] ? pagevec_lookup_tag+0x25/0x40
> [<ffffffff8111ffdb>] wait_on_page_writeback_range+0xfb/0x190
> [<ffffffff811201a8>] filemap_write_and_wait_range+0x78/0x90
> [<ffffffff811baa4e>] vfs_fsync_range+0x7e/0x100
> [<ffffffff811bab1b>] generic_write_sync+0x4b/0x50
> [<ffffffff81122056>] generic_file_aio_write+0xe6/0x100
> [<ffffffffa042f20e>] nfs_file_write+0xde/0x1f0 [nfs]
> [<ffffffff81188c8a>] do_sync_write+0xfa/0x140
> [<ffffffff8152a825>] ? page_fault+0x25/0x30
> [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
> [<ffffffff8128ec6f>] ? __clear_user+0x3f/0x70
> [<ffffffff8128ec51>] ? __clear_user+0x21/0x70
> [<ffffffff812263d6>] ? security_file_permission+0x16/0x20
> [<ffffffff81188f88>] vfs_write+0xb8/0x1a0
> [<ffffffff81189881>] sys_write+0x51/0x90
> [<ffffffff810e1e6e>] ? __audit_syscall_exit+0x25e/0x290
> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
>
> I have seen dd being in uninterruptible sleep on b.g.o. There are
> also instances [1] where anon-fd-nfs has run for close to 6000+
> seconds. This definitely points to the nfs deadlock.
[1] is a run where nfs.drc is still enabled. I'd like to know if you
have seen other, more recent runs where http://review.gluster.org/8004
has been included (disable nfs.drc by default).
Are there backtraces at the same time where alloc_pages() and/or
try_to_free_pages() are listed? The blocking of the writer (here: dd)
likely depends on the needed memory allocations on the receiving enf
(here: nfs-server). This is a relatively common issue for the Linux
kernel NFS server where loopback-mounts are used under memory pressure.
A nice description and proposed solution of this has recently been
posted to LWN.net:
- http://lwn.net/Articles/595652/
This solution is client-side (the NFS-client in the Linux kernel), and
that should help preventing these issues for Gluster-nfs too (with
a quick cursory look through it). But I don't think the patches have
been merged yet.
> >only
> >this one keeps failing for the past 2-3 days.
>
> It is a function of the system memory consumption and what oom
> killer decides to kill. If NFS or a glusterfsd process gets killed,
> then the test unit will fail. If the test can continue till the
> system reclaims memory, it can possibly succeed.
>
> However, there could be other possibilities and we need to root
> cause them as well.
Yes, I agree. It would help if there is a known way to trigger the OOM
so that investigation can be done on a different system than
build.gluster.org. Does anyone know of steps that reliably reproduce
this kind of issue?
Thanks,
Niels
>
>
> -Vijay
>
> [1] http://build.gluster.org/job/regression/4783/console
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
More information about the Gluster-devel
mailing list