[Gluster-devel] Release 3.12: Glusto run status
Atin Mukherjee
amukherj at redhat.com
Wed Aug 30 00:33:22 UTC 2017
On Wed, 30 Aug 2017 at 00:23, Shwetha Panduranga <spandura at redhat.com>
wrote:
> Hi Shyam, we are already doing it. we wait for rebalance status to be
> complete. We loop. we keep checking if the status is complete for '20'
> minutes or so.
>
Are you saying in this test rebalance status was executed multiple times
till it succeed? If yes then the test shouldn't have failed. Can I get to
access the complete set of logs?
> -Shwetha
>
> On Tue, Aug 29, 2017 at 7:04 PM, Shyam Ranganathan <srangana at redhat.com>
> wrote:
>
>> On 08/29/2017 09:31 AM, Atin Mukherjee wrote:
>>
>>>
>>>
>>> On Tue, Aug 29, 2017 at 4:13 AM, Shyam Ranganathan <srangana at redhat.com
>>> <mailto:srangana at redhat.com>> wrote:
>>>
>>> Nigel, Shwetha,
>>>
>>> The latest Glusto run [a] that was started by Nigel, post fixing the
>>> prior timeout issue, failed (much later though) again.
>>>
>>> I took a look at the logs and my analysis is here [b]
>>>
>>> @atin, @kaushal, @ppai can you take a look and see if the analysis
>>> is correct?
>>>
>>>
>>> I took a look at the logs and here is my theory:
>>>
>>> glusterd starts the rebalance daemon through runner framework with
>>> nowait mode which essentially means that even though glusterd reports back
>>> a success back to CLI for rebalance start, one of the node might take some
>>> more additional time to start the rebalance process and establish rpc
>>> connection. In this case we hit a race where while one of the node was
>>> still trying to start the rebalance process a rebalance status command was
>>> triggered which eventually failed on the node as rpc connection wasn't
>>> successful and originator glusterd's commit op failed with ""Received
>>> commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d" failure.
>>> Technically to avoid all these spurious time out issues we try to check the
>>> status in a loop till a certain timeout. Isn't that the case in glusto? If
>>> my analysis is correct, you shouldn't be seeing this failure on the 2nd
>>> attempt as its a race.
>>>
>>
>> Thanks Atin.
>>
>> In this case there is no second check or a timed check (sleep or
>> otherwise (EXPECT_WITHIN like constructs)).
>>
>> @Shwetha, can we fix up this test and give it another go?
>>
>>
>>>
>>> In short glusterd has got an error when checking for rebalance stats
>>> from one of the nodes as:
>>> "Received commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d"
>>>
>>> and the rebalance deamon on the node with that UUID is not really
>>> ready to serve requests when this was called, hence I am assuming
>>> this is causing the error. But need a once over by one of you folks.
>>>
>>> @Shwetha, can we add a further timeout between rebalance start and
>>> checking the status, just so that we avoid this timing issue on
>>> these nodes.
>>>
>>> Thanks,
>>> Shyam
>>>
>>> [a] glusto run:
>>> https://ci.centos.org/view/Gluster/job/gluster_glusto/377/
>>> <https://ci.centos.org/view/Gluster/job/gluster_glusto/377/>
>>>
>>> [b] analysis of the failure:
>>> https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w
>>> <https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w>
>>>
>>> On 08/25/2017 04:29 PM, Shyam Ranganathan wrote:
>>>
>>> Nigel was kind enough to kick off a glusto run on 3.12 head a
>>> couple of days back. The status can be seen here [1].
>>>
>>> The run failed, but managed to get past what Glusto does on
>>> master (see [2]). Not that this is a consolation, but just
>>> stating the fact.
>>>
>>> The run [1] failed at,
>>> 17:05:57
>>>
>>> functional/bvt/test_cvt.py::TestGlusterHealSanity_dispersed_glusterfs::test_self_heal_when_io_in_progress
>>> FAILED
>>>
>>> The test case failed due to,
>>> 17:10:28 E AssertionError: ('Volume %s : All process are
>>> not online', 'testvol_dispersed')
>>>
>>> The test case can be seen here [3], and the reason for failure
>>> is that Glusto did not wait long enough for the down brick to
>>> come up (it waited for 10 seconds, but the brick came up after
>>> 12 seconds or within the same second as the test for it being
>>> up. The log snippets pointing to this problem are here [4]. In
>>> short there was no real bug or issue that caused the failure as
>>> yet.
>>>
>>> Glusto as a gating factor for this release was desirable, but
>>> having got this far on 3.12 does help.
>>>
>>> @nigel, we could try post increasing the timeout between
>>> bringing the brick up to checking if it is up, and try another
>>> run, let me know if that works, and what is needed from me to
>>> get this going.
>>>
>>> Shyam
>>>
>>> [1] Glusto 3.12 run:
>>> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/
>>> <https://ci.centos.org/view/Gluster/job/gluster_glusto/365/>
>>>
>>> [2] Glusto on master:
>>>
>>> https://ci.centos.org/view/Gluster/job/gluster_glusto/360/testReport/functional.bvt.test_cvt/
>>> <
>>> https://ci.centos.org/view/Gluster/job/gluster_glusto/360/testReport/functional.bvt.test_cvt/
>>> >
>>>
>>>
>>> [3] Failed test case:
>>>
>>> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dispersed_glusterfs/test_self_heal_when_io_in_progress/
>>> <
>>> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dispersed_glusterfs/test_self_heal_when_io_in_progress/
>>> >
>>>
>>>
>>> [4] Log analysis pointing to the failed check:
>>> https://paste.fedoraproject.org/paste/znTPiFLrc2~vsWuoYRToZA
>>> <https://paste.fedoraproject.org/paste/znTPiFLrc2%7EvsWuoYRToZA>
>>>
>>> "Releases are made better together"
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>> <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>> <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>>
>>>
>>>
> --
- Atin (atinm)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170830/88bbdad1/attachment-0001.html>
More information about the Gluster-devel
mailing list