[Gluster-devel] spurious regression failures again! [bug-1112559.t]

Tue Jul 22 14:19:59 UTC 2014

On 2014-07-22 15:12, Joseph Fernandes wrote:
> Hi All,
> 
> As with further investigation found the following,
> 
> 1) Was the able to reproduce the issue, without running the complete regression, just by running bug-1112559.t only on slave30(which is been rebooted and a clean gluster setup).
>    This rules out any involvement of previous failure from other spurious errors like mgmt_v3-locks.t. 
> 2) Added some messages and script (netstat and ps -ef | grep gluster ) execution when the binding to a port fails (in rpc/rpc-transport/socket/src/socket.c) and found the following,
> 
>         Always the snapshot brick in second node (127.1.1.2) fails to acquire the port (eg : 127.1.1.2 : 49155 )
> 
>         Netstat output shows: 
>         tcp        0      0 127.1.1.2:49155             0.0.0.0:*                   LISTEN      3555/glusterfsd
Could this be a time to propose that gluster understands port reservation a'la systemd (LISTEN_FDS),
and make the test harness make sure that random ports do not collide with the set of expected ports,
which will be beneficial when starting from systemd as well.


> 
>         and the process that is holding the port 49155 is 
> 
>         root      3555     1  0 12:38 ?        00:00:00 /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p /d/backends/3/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid -S /var/run/ff772f1ff85950660f389b0ed43ba2b7.socket --brick-name /d/backends/2/patchy_snap_mnt -l /usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log --xlator-option *-posix.glusterd-uuid=3af134ec-5552-440f-ad24-1811308ca3a8 --brick-port 49155 --xlator-option patchy-server.listen-port=49155
> 
>         Please note even though it says 127.1.1.2 its shows the glusterd-uuid of the 3 node that was been probed when the snapshot was created "3af134ec-5552-440f-ad24-1811308ca3a8"
> 
>         To clarify things there, there are already a volume brick in 127.1.1.2
> 
>         root      3446     1  0 12:38 ?        00:00:00 /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p /d/backends/2/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid -S /var/run/e667c69aa7a1481c7bd567b917cd1b05.socket --brick-name /d/backends/2/patchy_snap_mnt -l /usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f --brick-port 49153 --xlator-option patchy-server.listen-port=49153
> 
>         And the above brick process(3555) is not visible before the snap creation or after the failure of the snap brick start on the 127.1.1.2
>         This means that this process was spawned and died during the creation of the snapshot and probe of the 3rd node (which happens simultaneously)
> 
>         In addition to these process, we can see multiple snap brick process for the second brick on second node, which are not seen after the failure to start the snap brick on 127.1.1.2
> 
>         root      3582     1  0 12:38 ?        00:00:00 /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id /snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2 -p /d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name /var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l /usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f --brick-port 49155 --xlator-option 66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155
>         root      3583  3582  0 12:38 ?        00:00:00 /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id /snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2 -p /d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name /var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l /usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f --brick-port 49155 --xlator-option 66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155
> 
> 
> 
> This looks like the second node tries to start snap brick
> 1) with wrong brickinfo and peerinfo (process 3555)
> 2) Multiple times with the correct brickinfo (process 3582,3583)
3583 is a subprocess of 3582, so it's only one invocation.

> 3) This issue is not seen when, snapshots are created and peer probe is NOT done simultaneously.
> 
> Will continue on the investigation and will keep you posted.
> 
> 
> Regards,
> Joe
> 
> 
> 
> 
> ----- Original Message -----
> From: "Joseph Fernandes" <josferna at redhat.com>
> To: "Avra Sengupta" <asengupt at redhat.com>
> Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>, "Gluster Devel" <gluster-devel at gluster.org>, "Varun Shastry" <vshastry at redhat.com>, "Justin Clift" <justin at gluster.org>
> Sent: Thursday, July 17, 2014 10:58:14 AM
> Subject: Re: [Gluster-devel] spurious regression failures again!
> 
> Hi Avra,
> 
> Just clarifying things here,
> 1) When testing with the setup provide by Justin, I found the only place where bug-1112559.t failed was after the failure mgmt_v3-locks.t in the previous regression run. The mail attached with the previous mail was just an OBSERVATION and NOT an INFERENCE that failure of mgmt_v3-locks.t was the root cause of bug-1112559.t . I am NOT jumping the gun and making any statement/conclusion here. Its just an OBSERVATION. And thanks for the clarification on why mgmt_v3-locks.t is failing.
> 
> 2) I agree with you that the cleanup script needs to kill all gluster* processes. And its also true that port range used by gluster for bricks is unique.
> But bug-1112559.t fails only because of the unavailability of port, to start the snap brick. Therefore this suggests that there is some process(gluster or non-gluster)
> still using the port. 
> 
> 3) And Finally that bug-1112559.t failing individually all the time is not true as when looked into the links which you have provided there are case where there are previous other test case failures, on the same testing machine (slave26). By this I am not pointing out that those failure are the root cause for bug-1112559.t to fail 
> As stated earlier its a notable OBSERVATION(Keeping in mind point 2 about ports and cleanup)
> 
> I have run nearly 30 runs on slave30 and only one time bug-1112559.t failed (As stated in point 1). I am continuing to run more runs. The only problem is the occurrence of bug-1112559.t failure is spurious and there is no deterministic way of reproducing it. 
> 
> Will keep all posted about the results.
> 
> Regards,
> Joe
> 
> 
> 
> ----- Original Message -----
> From: "Avra Sengupta" <asengupt at redhat.com>
> To: "Joseph Fernandes" <josferna at redhat.com>, "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Varun Shastry" <vshastry at redhat.com>, "Justin Clift" <justin at gluster.org>
> Sent: Wednesday, July 16, 2014 1:03:21 PM
> Subject: Re: [Gluster-devel] spurious regression failures again!
> 
> Joseph,
> 
> I am not sure I understand how this is affecting the spurious failure of 
> bug-1112559.t. As per the mail you have attached, and according to your 
> analysis,  bug-1112559.t fails because a cleanup hasn't happened 
> properly after a previous test-case failed and in your case there was a 
> crash as well.
> 
> Now out of all the times bug-1112559.t has failed, most of the time it's 
> the only test case failing and there isn't any crash. Below are the 
> regression runs that pranith had sent for the same.
> 
> http://build.gluster.org/job/rackspace-regression-2GB/541/consoleFull
> 
> http://build.gluster.org/job/rackspace-regression-2GB-triggered/173/consoleFull
> 
> http://build.gluster.org/job/rackspace-regression-2GB-triggered/172/consoleFull
> 
> http://build.gluster.org/job/rackspace-regression-2GB/543/console
> 
> In all of the above bug-1112559.t is the only test case that fails and 
> there is no crash.
> 
> So what I fail to understand here is, if this particular testcase fails 
> independently as well as with other testcases, then how can we conclude 
> that any other testcase failing is somehow not doing a cleanup properly 
> and that is the reason for bug-1112559.t failing.
> 
> mgmt_v3-locks.t fails because glusterd takes more time to register a 
> node going down, and hence the peer status doesn't return what the 
> testcase expects it to. It's a race. The testcase ends with a cleanup 
> routine like every other testcase, that kills all gluster and glusterfsd 
> processes, which might be using any brick ports. So could you please 
> explain how or which process still uses the brick ports that the snap 
> bricks are trying to use leading to the failure of bug-1112559.t.
> 
> Regards,
> Avra
> 
> On 07/15/2014 09:57 PM, Joseph Fernandes wrote:
>> Just pointing out ,
>>
>> 2) tests/basic/mgmt_v3-locks.t - Author: Avra
>> http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull
>>
>> This is the similar kind of error I saw in my testing of spurious failure tests/bugs/bug-1112559.t
>>
>> Please refer the attached mail.
>>
>> Regards,
>> Joe
>>
>>
>>
>> ----- Original Message -----
>> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>> To: "Joseph Fernandes" <josferna at redhat.com>
>> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Varun Shastry" <vshastry at redhat.com>
>> Sent: Tuesday, July 15, 2014 9:34:26 PM
>> Subject: Re: [Gluster-devel] spurious regression failures again!
>>
>>
>> On 07/15/2014 09:24 PM, Joseph Fernandes wrote:
>>> Hi Pranith,
>>>
>>> Could you please share the link of the console output of the failures.
>> Added them inline. Thanks for reminding :-)
>>
>> Pranith
>>> Regards,
>>> Joe
>>>
>>> ----- Original Message -----
>>> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>>> To: "Gluster Devel" <gluster-devel at gluster.org>, "Varun Shastry" <vshastry at redhat.com>
>>> Sent: Tuesday, July 15, 2014 8:52:44 PM
>>> Subject: [Gluster-devel] spurious regression failures again!
>>>
>>> hi,
>>>        We have 4 tests failing once in a while causing problems:
>>> 1) tests/bugs/bug-1087198.t - Author: Varun
>> http://build.gluster.org/job/rackspace-regression-2GB-triggered/379/consoleFull
>>> 2) tests/basic/mgmt_v3-locks.t - Author: Avra
>> http://build.gluster.org/job/rackspace-regression-2GB-triggered/375/consoleFull
>>> 3) tests/basic/fops-sanity.t - Author: Pranith
>> http://build.gluster.org/job/rackspace-regression-2GB-triggered/383/consoleFull
>>> Please take a look at them and post updates.
>>>
>>> Pranith
/Anders


-- 
Anders Blomdell                  Email: anders.blomdell at control.lth.se
Department of Automatic Control
Lund University                  Phone:    +46 46 222 4625
P.O. Box 118                     Fax:      +46 46 138118
SE-221 00 Lund, Sweden