[Gluster-devel] Release 3.12: Glusto run status

Wed Aug 30 00:33:22 UTC 2017

On Wed, 30 Aug 2017 at 00:23, Shwetha Panduranga <spandura at redhat.com>
wrote:

> Hi Shyam, we are already doing it. we wait for rebalance status to be
> complete. We loop. we keep checking if the status is complete for '20'
> minutes or so.
>

Are you saying in this test rebalance status was executed multiple times
till it succeed? If yes then the test shouldn't have failed. Can I get to
access the complete set of logs?

> -Shwetha
>
> On Tue, Aug 29, 2017 at 7:04 PM, Shyam Ranganathan <srangana at redhat.com>
> wrote:
>
>> On 08/29/2017 09:31 AM, Atin Mukherjee wrote:
>>
>>>
>>>
>>> On Tue, Aug 29, 2017 at 4:13 AM, Shyam Ranganathan <srangana at redhat.com
>>> <mailto:srangana at redhat.com>> wrote:
>>>
>>>     Nigel, Shwetha,
>>>
>>>     The latest Glusto run [a] that was started by Nigel, post fixing the
>>>     prior timeout issue, failed (much later though) again.
>>>
>>>     I took a look at the logs and my analysis is here [b]
>>>
>>>     @atin, @kaushal, @ppai can you take a look and see if the analysis
>>>     is correct?
>>>
>>>
>>> I took a look at the logs and here is my theory:
>>>
>>> glusterd starts the rebalance daemon through runner framework with
>>> nowait mode which essentially means that even though glusterd reports back
>>> a success back to CLI for rebalance start, one of the node might take some
>>> more additional time to start the rebalance process and establish rpc
>>> connection. In this case we hit a race where while one of the node was
>>> still trying to start the rebalance process a rebalance status command was
>>> triggered which eventually failed on the node as rpc connection wasn't
>>> successful and originator glusterd's commit op failed with  ""Received
>>> commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d" failure.
>>> Technically to avoid all these spurious time out issues we try to check the
>>> status in a loop till a certain timeout. Isn't that the case in glusto? If
>>> my analysis is correct, you shouldn't be seeing this failure on the 2nd
>>> attempt as its a race.
>>>
>>
>> Thanks Atin.
>>
>> In this case there is no second check or a timed check (sleep or
>> otherwise (EXPECT_WITHIN like constructs)).
>>
>> @Shwetha, can we fix up this test and give it another go?
>>
>>
>>>
>>>     In short glusterd has got an error when checking for rebalance stats
>>>     from one of the nodes as:
>>>     "Received commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d"
>>>
>>>     and the rebalance deamon on the node with that UUID is not really
>>>     ready to serve requests when this was called, hence I am assuming
>>>     this is causing the error. But need a once over by one of you folks.
>>>
>>>     @Shwetha, can we add a further timeout between rebalance start and
>>>     checking the status, just so that we avoid this timing issue on
>>>     these nodes.
>>>
>>>     Thanks,
>>>     Shyam
>>>
>>>     [a] glusto run:
>>>     https://ci.centos.org/view/Gluster/job/gluster_glusto/377/
>>>     <https://ci.centos.org/view/Gluster/job/gluster_glusto/377/>
>>>
>>>     [b] analysis of the failure:
>>>     https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w
>>>     <https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w>
>>>
>>>     On 08/25/2017 04:29 PM, Shyam Ranganathan wrote:
>>>
>>>         Nigel was kind enough to kick off a glusto run on 3.12 head a
>>>         couple of days back. The status can be seen here [1].
>>>
>>>         The run failed, but managed to get past what Glusto does on
>>>         master (see [2]). Not that this is a consolation, but just
>>>         stating the fact.
>>>
>>>         The run [1] failed at,
>>>         17:05:57
>>>
>>> functional/bvt/test_cvt.py::TestGlusterHealSanity_dispersed_glusterfs::test_self_heal_when_io_in_progress
>>>         FAILED
>>>
>>>         The test case failed due to,
>>>         17:10:28 E       AssertionError: ('Volume %s : All process are
>>>         not online', 'testvol_dispersed')
>>>
>>>         The test case can be seen here [3], and the reason for failure
>>>         is that Glusto did not wait long enough for the down brick to
>>>         come up (it waited for 10 seconds, but the brick came up after
>>>         12 seconds or within the same second as the test for it being
>>>         up. The log snippets pointing to this problem are here [4]. In
>>>         short there was no real bug or issue that caused the failure as
>>> yet.
>>>
>>>         Glusto as a gating factor for this release was desirable, but
>>>         having got this far on 3.12 does help.
>>>
>>>         @nigel, we could try post increasing the timeout between
>>>         bringing the brick up to checking if it is up, and try another
>>>         run, let me know if that works, and what is needed from me to
>>>         get this going.
>>>
>>>         Shyam
>>>
>>>         [1] Glusto 3.12 run:
>>>         https://ci.centos.org/view/Gluster/job/gluster_glusto/365/
>>>         <https://ci.centos.org/view/Gluster/job/gluster_glusto/365/>
>>>
>>>         [2] Glusto on master:
>>>
>>> https://ci.centos.org/view/Gluster/job/gluster_glusto/360/testReport/functional.bvt.test_cvt/
>>>         <
>>> https://ci.centos.org/view/Gluster/job/gluster_glusto/360/testReport/functional.bvt.test_cvt/
>>> >
>>>
>>>
>>>         [3] Failed test case:
>>>
>>> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dispersed_glusterfs/test_self_heal_when_io_in_progress/
>>>         <
>>> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dispersed_glusterfs/test_self_heal_when_io_in_progress/
>>> >
>>>
>>>
>>>         [4] Log analysis pointing to the failed check:
>>>         https://paste.fedoraproject.org/paste/znTPiFLrc2~vsWuoYRToZA
>>>         <https://paste.fedoraproject.org/paste/znTPiFLrc2%7EvsWuoYRToZA>
>>>
>>>         "Releases are made better together"
>>>         _______________________________________________
>>>         Gluster-devel mailing list
>>>         Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>>         http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>         <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>>
>>>     _______________________________________________
>>>     Gluster-devel mailing list
>>>     Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>>     http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>     <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>>
>>>
>>>
> --
- Atin (atinm)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170830/88bbdad1/attachment-0001.html>