[Gluster-devel] Problems with ec/nfs.t in regression tests

Thu Feb 12 18:09:51 UTC 2015

On 02/12/2015 11:34 PM, Pranith Kumar Karampuri wrote:
>
> On 02/12/2015 08:15 PM, Xavier Hernandez wrote:
>> I've made some more investigation and the problem seems worse.
>>
>> It seems that NFS sends a huge amount of requests without waiting for 
>> answers (I've had more than 1400 requests ongoing). Probably there 
>> will be many factors that can influence on the load that this causes, 
>> and one of them could be ec, but it's not related exclusively to ec. 
>> I've repeated the test using a replica 3 and a replica 2 volumes and 
>> the problem still happens.
>>
>> The test basically writes a file to an NFS mount using 'dd'. The file 
>> has a size of 1GB. With a smaller file, the test passes successfully.
> Using NFS client and gluster NFS server on same machine with BIG file 
> dd operations is known to cause hangs. anon-fd-quota.t used to give 
> similar problems so we changed the test to not involve NFS mounts.
I don't re-collect the exact scenario. Avati found the deadlock of 
memory allocation, when I just joined gluster, in 2010. Raghavendra Bhat 
raised this bug then. CCed him to the thread as well if he knows the 
exact scenario.

Pranith
>
> Pranith
>>
>> One important thing to note is that I'm not using powerful servers (a 
>> dual core Intel Atom), but this problem shouldn't happen anyway. It 
>> can even happen on more powerful servers if they are busy doing other 
>> things (maybe this is what's happening on jenkins' slaves).
>>
>> I think that this causes some NFS requests to timeout. This can be 
>> seen in /var/log/messages (there are many of these messages):
>>
>> Feb 12 15:18:45 celler01 kernel: nfs: server gf01.datalab.es not 
>> responding, timed out
>>
>> nfs log also has many errors:
>>
>> [2015-02-12 14:18:45.132905] E [rpcsvc.c:1257:rpcsvc_submit_generic] 
>> 0-rpc-service: failed to submit message (XID: 0x7be78dbe, Program: 
>> NFS3, ProgVers: 3, Proc: 7) to rpc
>> -transport (socket.nfs-server)
>> [2015-02-12 14:18:45.133009] E [nfs3.c:565:nfs3svc_submit_reply] 
>> 0-nfs-nfsv3: Reply submission failed
>>
>> Additionally this causes disconnections from NFS that are not 
>> correctly handled causing that a thread gets stuck in an infinite 
>> loop (I haven't analyzed this problem deeply, but it seems like an 
>> attempt to use an already disconnected socket). After a while, I get 
>> this error on the nfs log:
>>
>> [2015-02-12 14:20:19.545429] C 
>> [rpc-clnt-ping.c:109:rpc_clnt_ping_timer_expired] 0-patchy-client-0: 
>> server 192.168.200.61:49152 has not responded in the last 42 seconds, 
>> disconnecting.
>>
>> The console executing the test shows this (nfs.t is creating a 
>> replica 3 instead of a dispersed volume):
>>
>> # ./run-tests.sh tests/basic/ec/nfs.t
>>
>> ... GlusterFS Test Framework ...
>>
>> Running tests in file ./tests/basic/ec/nfs.t
>> [14:12:52] ./tests/basic/ec/nfs.t .. 8/10 dd: error writing 
>> ‘/mnt/nfs/0/test’: Input/output error
>> [14:12:52] ./tests/basic/ec/nfs.t .. 9/10
>> not ok 9
>> [14:12:52] ./tests/basic/ec/nfs.t .. Failed 1/10 subtests
>> [14:27:41]
>>
>> Test Summary Report
>> -------------------
>> ./tests/basic/ec/nfs.t (Wstat: 0 Tests: 10 Failed: 1)
>> Failed test: 9
>> Files=1, Tests=10, 889 wallclock secs ( 0.13 usr 0.02 sys + 1.29 cusr 
>> 3.45 csys = 4.89 CPU)
>> Result: FAIL
>> Failed tests ./tests/basic/ec/nfs.t
>>
>> Note that the test takes almost 15 minutes to complete.
>>
>> Is there any way to limit the number of requests NFS sends without 
>> having an answer ?
>>
>> Xavi
>>
>> On 02/11/2015 04:20 PM, Shyam wrote:
>>> On 02/11/2015 09:40 AM, Xavier Hernandez wrote:
>>>> Hi,
>>>>
>>>> it seems that there are some failures in ec/nfs.t test on regression
>>>> tests. Doing some investigation I've found that before applying the
>>>> multi-threaded patch (commit 5e25569e) the problem does not seem to
>>>> happen.
>>>
>>> This has in interesting history in failures, on the regression runs for
>>> the MT epoll this (i.e ec/nfs.t) did not fail (there were others, but
>>> not nfs.t).
>>>
>>> The patch that allows configuration of MT epoll is where this started
>>> failing around Feb 5th (but later passed). (see patchset 7 failures on,
>>> http://review.gluster.org/#/c/9488/ )
>>>
>>> I state the above, as it may help narrowing down the changes in EC
>>> (maybe) that could have caused it.
>>>
>>> Also in the latter commit, there was an error configuring the number of
>>> threads so all regression runs would have run with a single epoll 
>>> thread
>>> (the MT epoll patch had this hard coded, so that would have run with 2
>>> threads, but did not show up the issue (patch:
>>> http://review.gluster.org/#/c/3842/)).
>>>
>>> Again I state the above, as this should not be exposing a
>>> race/bug/problem due to the multi threaded nature of epoll, but of
>>> course needs investigation.
>>>
>>>>
>>>> I'm not sure if this patch is the cause or it has revealed some bug in
>>>> ec or any other xlator.
>>>
>>> I guess we can reproduce this issue? If so I would try setting
>>> client.event-threads on master branch to 1, restarting the volume and
>>> then running the test (as a part of the test itself maybe) to eliminate
>>> the possibility that MT epoll is causing it.
>>>
>>> My belief on MT epoll causing it is in doubt as the runs failed on the
>>> http://review.gluster.org/#/c/9488/ (configuration patch), which had 
>>> the
>>> thread count as 1 due to a bug in that code.
>>>
>>>>
>>>> I can try to identify it (any help will be appreciated), but it may 
>>>> take
>>>> some time. Would it be better to remove the test in the meantime ?
>>>
>>> I am checking if this is reproducible on my machine, so that I can
>>> possibly see what is going wrong.
>>>
>>> Shyam
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel