[Gluster-devel] Spurious failures because of nfs and snapshots

Wed May 21 06:53:02 UTC 2014

 From the log: 
http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a17%3a10%3a51.tgzit 
looks like glusterd was hung:

*Glusterd log:**
* 5305 [2014-05-20 20:08:55.040665] E 
[glusterd-snapshot.c:3805:glusterd_add_brick_to_snap_volume] 
0-management: Unable to fetch snap device (vol1.brick_snapdevice0). 
Leaving empty
  5306 [2014-05-20 20:08:55.649146] I 
[rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
  5307 [2014-05-20 20:08:55.663181] I 
[rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
  5308 [2014-05-20 20:16:55.541197] W 
[glusterfsd.c:1182:cleanup_and_exit] (--> 0-: received signum (15), 
shutting down

Glusterd was hung when executing the testcase ./tests/bugs/bug-1090042.t.

*Cli log:**
*72649 [2014-05-20 20:12:51.960765] T 
[rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
  72650 [2014-05-20 20:12:51.960850] T [socket.c:2689:socket_connect] 
(-->/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) 
[0x7ff8b6609994] 
(-->/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) 
[0x7ff8b5d3305b] (- 
->/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) 
[0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport already 
connected
  72651 [2014-05-20 20:12:52.960943] T 
[rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
  72652 [2014-05-20 20:12:52.960999] T [socket.c:2697:socket_connect] 
0-glusterfs: connecting 0x1e0fcc0, state=0 gen=0 sock=-1
  72653 [2014-05-20 20:12:52.961038] W [dict.c:1059:data_to_str] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) 
[0x7ff8ad9e95f3] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien 
t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0xf1) 
[0x7ff8ad9ec7d0]))) 0-dict: data is NULL
  72654 [2014-05-20 20:12:52.961070] W [dict.c:1059:data_to_str] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) 
[0x7ff8ad9e95f3] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien 
t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0x100) 
[0x7ff8ad9ec7df]))) 0-dict: data is NULL
  72655 [2014-05-20 20:12:52.961079] E 
[name.c:140:client_fill_address_family] 0-glusterfs: 
transport.address-family not specified. Could not guess default value 
from (remote-host:(null) or transport.unix.connect-path:(null)) 
optio       ns
  72656 [2014-05-20 20:12:54.961273] T 
[rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
  72657 [2014-05-20 20:12:54.961404] T [socket.c:2689:socket_connect] 
(-->/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) 
[0x7ff8b6609994] 
(-->/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) 
[0x7ff8b5d3305b] (- 
->/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) 
[0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport already 
connected
  72658 [2014-05-20 20:12:55.120645] D [cli-cmd.c:384:cli_cmd_submit] 
0-cli: Returning 110
  72659 [2014-05-20 20:12:55.120723] D 
[cli-rpc-ops.c:8716:gf_cli_snapshot] 0-cli: Returning 110

Now we need to find why glusterd was hung.

Thanks,
Vijay

On Wednesday 21 May 2014 06:46 AM, Pranith Kumar Karampuri wrote:
> Hey,
>      Seems like even after this fix is merged, the regression tests are failing for the same script. You can check the logs at http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz
>
> Relevant logs:
> [2014-05-20 20:17:07.026045]  : volume create patchy build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : SUCCESS
> [2014-05-20 20:17:08.030673]  : volume start patchy : SUCCESS
> [2014-05-20 20:17:08.279148]  : volume barrier patchy enable : SUCCESS
> [2014-05-20 20:17:08.476785]  : volume barrier patchy enable : FAILED : Failed to reconfigure barrier.
> [2014-05-20 20:17:08.727429]  : volume barrier patchy disable : SUCCESS
> [2014-05-20 20:17:08.926995]  : volume barrier patchy disable : FAILED : Failed to reconfigure barrier.
>
> Pranith
>
> ----- Original Message -----
>> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>> To: "Gluster Devel" <gluster-devel at gluster.org>
>> Cc: "Joseph Fernandes" <josferna at redhat.com>, "Vijaikumar M" <vmallika at redhat.com>
>> Sent: Tuesday, May 20, 2014 3:41:11 PM
>> Subject: Re: Spurious failures because of nfs and snapshots
>>
>> hi,
>>      Please resubmit the patches on top of http://review.gluster.com/#/c/7753
>>      to prevent frequent regression failures.
>>
>> Pranith
>> ----- Original Message -----
>>> From: "Vijaikumar M" <vmallika at redhat.com>
>>> To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>>> Cc: "Joseph Fernandes" <josferna at redhat.com>, "Gluster Devel"
>>> <gluster-devel at gluster.org>
>>> Sent: Monday, May 19, 2014 2:40:47 PM
>>> Subject: Re: Spurious failures because of nfs and snapshots
>>>
>>> Brick disconnected with ping-time out:
>>>
>>> Here is the log message
>>> [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main]
>>> 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi
>>> n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s
>>> build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9
>>> 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3
>>> -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f
>>> bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid
>>> -S /var/run/51fe50a6faf0aa    e006c815da946caf3a.socket --brick-name
>>> /var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l
>>> /build/install/var/log/glusterfs/br
>>> icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log
>>> --xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba
>>> 7b4b0 --brick-port 49164 --xlator-option
>>> 3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164)
>>>     2 [2014-05-19 04:29:38.141118] I
>>> [rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting
>>> ping-timeout to 30secs
>>>     3 [2014-05-19 04:30:09.139521] C
>>> [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server
>>> 10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting.
>>>
>>>
>>>
>>> Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where
>>> ping-timer will be disabled by default for all the rpc connection except
>>> for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec).
>>>
>>>
>>> Thanks,
>>> Vijay
>>>
>>>
>>> On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote:
>>>> The latest build failure also has the same issue:
>>>> Download it from here:
>>>> http://build.gluster.org:443/logs/glusterfs-logs-20140518%3a22%3a27%3a31.tgz
>>>>
>>>> Pranith
>>>>
>>>> ----- Original Message -----
>>>>> From: "Vijaikumar M" <vmallika at redhat.com>
>>>>> To: "Joseph Fernandes" <josferna at redhat.com>
>>>>> Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>, "Gluster Devel"
>>>>> <gluster-devel at gluster.org>
>>>>> Sent: Monday, 19 May, 2014 11:41:28 AM
>>>>> Subject: Re: Spurious failures because of nfs and snapshots
>>>>>
>>>>> Hi Joseph,
>>>>>
>>>>> In the log mentioned below, it say ping-time is set to default value
>>>>> 30sec.I think issue is different.
>>>>> Can you please point me to the logs where you where able to re-create
>>>>> the problem.
>>>>>
>>>>> Thanks,
>>>>> Vijay
>>>>>
>>>>>
>>>>>
>>>>> On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote:
>>>>>> hi Vijai, Joseph,
>>>>>>        In 2 of the last 3 build failures,
>>>>>>        http://build.gluster.org/job/regression/4479/console,
>>>>>>        http://build.gluster.org/job/regression/4478/console this
>>>>>>        test(tests/bugs/bug-1090042.t) failed. Do you guys think it is
>>>>>>        better
>>>>>>        to revert this test until the fix is available? Please send a
>>>>>>        patch
>>>>>>        to revert the test case if you guys feel so. You can re-submit it
>>>>>>        along with the fix to the bug mentioned by Joseph.
>>>>>>
>>>>>> Pranith.
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Joseph Fernandes" <josferna at redhat.com>
>>>>>>> To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>>>> Sent: Friday, 16 May, 2014 5:13:57 PM
>>>>>>> Subject: Re: Spurious failures because of nfs and snapshots
>>>>>>>
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> tests/bugs/bug-1090042.t :
>>>>>>>
>>>>>>> I was able to reproduce the issue i.e when this test is done in a loop
>>>>>>>
>>>>>>> for i in {1..135} ; do  ./bugs/bug-1090042.t
>>>>>>>
>>>>>>> When checked the logs
>>>>>>> [2014-05-16 10:49:49.003978] I
>>>>>>> [rpc-clnt.c:973:rpc_clnt_connection_init]
>>>>>>> 0-management: setting frame-timeout to 600
>>>>>>> [2014-05-16 10:49:49.004035] I
>>>>>>> [rpc-clnt.c:988:rpc_clnt_connection_init]
>>>>>>> 0-management: defaulting ping-timeout to 30secs
>>>>>>> [2014-05-16 10:49:49.004303] I
>>>>>>> [rpc-clnt.c:973:rpc_clnt_connection_init]
>>>>>>> 0-management: setting frame-timeout to 600
>>>>>>> [2014-05-16 10:49:49.004340] I
>>>>>>> [rpc-clnt.c:988:rpc_clnt_connection_init]
>>>>>>> 0-management: defaulting ping-timeout to 30secs
>>>>>>>
>>>>>>> The issue is with ping-timeout and is tracked under the bug
>>>>>>>
>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1096729
>>>>>>>
>>>>>>>
>>>>>>> The workaround is mentioned in
>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Joe
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>>>>>>> To: "Gluster Devel" <gluster-devel at gluster.org>
>>>>>>> Cc: "Joseph Fernandes" <josferna at redhat.com>
>>>>>>> Sent: Friday, May 16, 2014 6:19:54 AM
>>>>>>> Subject: Spurious failures because of nfs and snapshots
>>>>>>>
>>>>>>> hi,
>>>>>>>        In the latest build I fired for review.gluster.com/7766
>>>>>>>        (http://build.gluster.org/job/regression/4443/console) failed
>>>>>>>        because
>>>>>>>        of
>>>>>>>        spurious failure. The script doesn't wait for nfs export to be
>>>>>>>        available. I fixed that, but interestingly I found quite a few
>>>>>>>        scripts
>>>>>>>        with same problem. Some of the scripts are relying on 'sleep 5'
>>>>>>>        which
>>>>>>>        also could lead to spurious failures if the export is not
>>>>>>>        available
>>>>>>>        in 5
>>>>>>>        seconds. We found that waiting for 20 seconds is better, but
>>>>>>>        'sleep
>>>>>>>        20'
>>>>>>>        would unnecessarily delay the build execution. So if you guys
>>>>>>>        are
>>>>>>>        going
>>>>>>>        to write any scripts which has to do nfs mounts, please do it
>>>>>>>        the
>>>>>>>        following way:
>>>>>>>
>>>>>>> EXPECT_WITHIN 20 "1" is_nfs_export_available;
>>>>>>> TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;
>>>>>>>
>>>>>>> Please review http://review.gluster.com/7773 :-)
>>>>>>>
>>>>>>> I saw one more spurious failure in a snapshot related script
>>>>>>> tests/bugs/bug-1090042.t on the next build fired by Niels.
>>>>>>> Joesph (CCed) is debugging it. He agreed to reply what he finds and
>>>>>>> share
>>>>>>> it
>>>>>>> with us so that we won't introduce similar bugs in future.
>>>>>>>
>>>>>>> I encourage you guys to share what you fix to prevent spurious
>>>>>>> failures
>>>>>>> in
>>>>>>> future.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Pranith
>>>>>>>
>>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140521/864c9de2/attachment-0001.html>