[Gluster-devel] Release 3.12: Glusto run status

Tue Aug 29 13:34:41 UTC 2017

On 08/29/2017 09:31 AM, Atin Mukherjee wrote:
> 
> 
> On Tue, Aug 29, 2017 at 4:13 AM, Shyam Ranganathan <srangana at redhat.com 
> <mailto:srangana at redhat.com>> wrote:
> 
>     Nigel, Shwetha,
> 
>     The latest Glusto run [a] that was started by Nigel, post fixing the
>     prior timeout issue, failed (much later though) again.
> 
>     I took a look at the logs and my analysis is here [b]
> 
>     @atin, @kaushal, @ppai can you take a look and see if the analysis
>     is correct?
> 
> 
> I took a look at the logs and here is my theory:
> 
> glusterd starts the rebalance daemon through runner framework with 
> nowait mode which essentially means that even though glusterd reports 
> back a success back to CLI for rebalance start, one of the node might 
> take some more additional time to start the rebalance process and 
> establish rpc connection. In this case we hit a race where while one of 
> the node was still trying to start the rebalance process a rebalance 
> status command was triggered which eventually failed on the node as rpc 
> connection wasn't successful and originator glusterd's commit op failed 
> with  ""Received commit RJT from uuid: 
> 6f9524e6-9f9e-44aa-b2f4-393404adfd9d" failure. Technically to avoid all 
> these spurious time out issues we try to check the status in a loop till 
> a certain timeout. Isn't that the case in glusto? If my analysis is 
> correct, you shouldn't be seeing this failure on the 2nd attempt as its 
> a race.

Thanks Atin.

In this case there is no second check or a timed check (sleep or 
otherwise (EXPECT_WITHIN like constructs)).

@Shwetha, can we fix up this test and give it another go?

> 
> 
>     In short glusterd has got an error when checking for rebalance stats
>     from one of the nodes as:
>     "Received commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d"
> 
>     and the rebalance deamon on the node with that UUID is not really
>     ready to serve requests when this was called, hence I am assuming
>     this is causing the error. But need a once over by one of you folks.
> 
>     @Shwetha, can we add a further timeout between rebalance start and
>     checking the status, just so that we avoid this timing issue on
>     these nodes.
> 
>     Thanks,
>     Shyam
> 
>     [a] glusto run:
>     https://ci.centos.org/view/Gluster/job/gluster_glusto/377/
>     <https://ci.centos.org/view/Gluster/job/gluster_glusto/377/>
> 
>     [b] analysis of the failure:
>     https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w
>     <https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w>
> 
>     On 08/25/2017 04:29 PM, Shyam Ranganathan wrote:
> 
>         Nigel was kind enough to kick off a glusto run on 3.12 head a
>         couple of days back. The status can be seen here [1].
> 
>         The run failed, but managed to get past what Glusto does on
>         master (see [2]). Not that this is a consolation, but just
>         stating the fact.
> 
>         The run [1] failed at,
>         17:05:57
>         functional/bvt/test_cvt.py::TestGlusterHealSanity_dispersed_glusterfs::test_self_heal_when_io_in_progress
>         FAILED
> 
>         The test case failed due to,
>         17:10:28 E       AssertionError: ('Volume %s : All process are
>         not online', 'testvol_dispersed')
> 
>         The test case can be seen here [3], and the reason for failure
>         is that Glusto did not wait long enough for the down brick to
>         come up (it waited for 10 seconds, but the brick came up after
>         12 seconds or within the same second as the test for it being
>         up. The log snippets pointing to this problem are here [4]. In
>         short there was no real bug or issue that caused the failure as yet.
> 
>         Glusto as a gating factor for this release was desirable, but
>         having got this far on 3.12 does help.
> 
>         @nigel, we could try post increasing the timeout between
>         bringing the brick up to checking if it is up, and try another
>         run, let me know if that works, and what is needed from me to
>         get this going.
> 
>         Shyam
> 
>         [1] Glusto 3.12 run:
>         https://ci.centos.org/view/Gluster/job/gluster_glusto/365/
>         <https://ci.centos.org/view/Gluster/job/gluster_glusto/365/>
> 
>         [2] Glusto on master:
>         https://ci.centos.org/view/Gluster/job/gluster_glusto/360/testReport/functional.bvt.test_cvt/
>         <https://ci.centos.org/view/Gluster/job/gluster_glusto/360/testReport/functional.bvt.test_cvt/>
> 
> 
>         [3] Failed test case:
>         https://ci.centos.org/view/Gluster/job/gluster_glusto/365/testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dispersed_glusterfs/test_self_heal_when_io_in_progress/
>         <https://ci.centos.org/view/Gluster/job/gluster_glusto/365/testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dispersed_glusterfs/test_self_heal_when_io_in_progress/>
> 
> 
>         [4] Log analysis pointing to the failed check:
>         https://paste.fedoraproject.org/paste/znTPiFLrc2~vsWuoYRToZA
>         <https://paste.fedoraproject.org/paste/znTPiFLrc2%7EvsWuoYRToZA>
> 
>         "Releases are made better together"
>         _______________________________________________
>         Gluster-devel mailing list
>         Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>         http://lists.gluster.org/mailman/listinfo/gluster-devel
>         <http://lists.gluster.org/mailman/listinfo/gluster-devel>
> 
>     _______________________________________________
>     Gluster-devel mailing list
>     Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>     http://lists.gluster.org/mailman/listinfo/gluster-devel
>     <http://lists.gluster.org/mailman/listinfo/gluster-devel>
> 
>