[Gluster-devel] Shall we revert quota-anon-fd.t?

Wed Jun 11 08:01:04 UTC 2014

On 06/11/2014 10:45 AM, Pranith Kumar Karampuri wrote:
>
> On 06/11/2014 09:45 AM, Vijay Bellur wrote:
>> On 06/11/2014 08:21 AM, Pranith Kumar Karampuri wrote:
>>> hi,
>>>     I see that quota-anon-fd.t is causing too many spurious failures. I
>>> think we should revert it and raise a bug so that it can be fixed and
>>> committed again along with the fix.
>>>
>>
>> I think we can do that. The problem here is stemming from the issue
>> that nfs can deadlock when we have client and servers on the same node
>> with system memory utilization being on the higher side. We also need
>> to look into other nfs tests to determine if there are similar
>> possibilities.
>
> I doubt it is because of that, there are so many nfs mount tests,

I have been following this problem closely on b.g.o. This backtrace does 
indicate dd being hung:

INFO: task dd:6039 blocked for more than 120 seconds.
       Not tainted 2.6.32-431.3.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dd            D ffff880028100840     0  6039   5704 0x00000080
  ffff8801f843faa8 0000000000000286 ffff8801ffffffff 01eff88bb6f58e28
  ffff8801db96bb80 ffff8801f8213590 00000000036c74dc ffffffffac6f4edf
  ffff8801faf11af8 ffff8801f843ffd8 000000000000fbc8 ffff8801faf11af8
Call Trace:
  [<ffffffff810a70b1>] ? ktime_get_ts+0xb1/0xf0
  [<ffffffff8111f940>] ? sync_page+0x0/0x50
  [<ffffffff815280b3>] io_schedule+0x73/0xc0
  [<ffffffff8111f97d>] sync_page+0x3d/0x50
  [<ffffffff81528b7f>] __wait_on_bit+0x5f/0x90
  [<ffffffff8111fbb3>] wait_on_page_bit+0x73/0x80
  [<ffffffff8109b330>] ? wake_bit_function+0x0/0x50
  [<ffffffff81135c05>] ? pagevec_lookup_tag+0x25/0x40
  [<ffffffff8111ffdb>] wait_on_page_writeback_range+0xfb/0x190
  [<ffffffff811201a8>] filemap_write_and_wait_range+0x78/0x90
  [<ffffffff811baa4e>] vfs_fsync_range+0x7e/0x100
  [<ffffffff811bab1b>] generic_write_sync+0x4b/0x50
  [<ffffffff81122056>] generic_file_aio_write+0xe6/0x100
  [<ffffffffa042f20e>] nfs_file_write+0xde/0x1f0 [nfs]
  [<ffffffff81188c8a>] do_sync_write+0xfa/0x140
  [<ffffffff8152a825>] ? page_fault+0x25/0x30
  [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
  [<ffffffff8128ec6f>] ? __clear_user+0x3f/0x70
  [<ffffffff8128ec51>] ? __clear_user+0x21/0x70
  [<ffffffff812263d6>] ? security_file_permission+0x16/0x20
  [<ffffffff81188f88>] vfs_write+0xb8/0x1a0
  [<ffffffff81189881>] sys_write+0x51/0x90
  [<ffffffff810e1e6e>] ? __audit_syscall_exit+0x25e/0x290
  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

I have seen dd being in uninterruptible sleep on b.g.o. There are also 
instances [1] where anon-fd-nfs has run for close to 6000+ seconds. This 
definitely points to the nfs deadlock.

> only
> this one keeps failing for the past 2-3 days.

It is a function of the system memory consumption and what oom killer 
decides to kill. If NFS or a glusterfsd process gets killed, then the 
test unit will fail. If the test can continue till the system reclaims 
memory, it can possibly succeed.

However, there could be other possibilities and we need to root cause 
them as well.

-Vijay

[1] http://build.gluster.org/job/regression/4783/console