[Gluster-devel] Release 3.12: Glusto run status
Shyam Ranganathan
srangana at redhat.com
Tue Aug 29 13:34:41 UTC 2017
On 08/29/2017 09:31 AM, Atin Mukherjee wrote:
>
>
> On Tue, Aug 29, 2017 at 4:13 AM, Shyam Ranganathan <srangana at redhat.com
> <mailto:srangana at redhat.com>> wrote:
>
> Nigel, Shwetha,
>
> The latest Glusto run [a] that was started by Nigel, post fixing the
> prior timeout issue, failed (much later though) again.
>
> I took a look at the logs and my analysis is here [b]
>
> @atin, @kaushal, @ppai can you take a look and see if the analysis
> is correct?
>
>
> I took a look at the logs and here is my theory:
>
> glusterd starts the rebalance daemon through runner framework with
> nowait mode which essentially means that even though glusterd reports
> back a success back to CLI for rebalance start, one of the node might
> take some more additional time to start the rebalance process and
> establish rpc connection. In this case we hit a race where while one of
> the node was still trying to start the rebalance process a rebalance
> status command was triggered which eventually failed on the node as rpc
> connection wasn't successful and originator glusterd's commit op failed
> with ""Received commit RJT from uuid:
> 6f9524e6-9f9e-44aa-b2f4-393404adfd9d" failure. Technically to avoid all
> these spurious time out issues we try to check the status in a loop till
> a certain timeout. Isn't that the case in glusto? If my analysis is
> correct, you shouldn't be seeing this failure on the 2nd attempt as its
> a race.
Thanks Atin.
In this case there is no second check or a timed check (sleep or
otherwise (EXPECT_WITHIN like constructs)).
@Shwetha, can we fix up this test and give it another go?
>
>
> In short glusterd has got an error when checking for rebalance stats
> from one of the nodes as:
> "Received commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d"
>
> and the rebalance deamon on the node with that UUID is not really
> ready to serve requests when this was called, hence I am assuming
> this is causing the error. But need a once over by one of you folks.
>
> @Shwetha, can we add a further timeout between rebalance start and
> checking the status, just so that we avoid this timing issue on
> these nodes.
>
> Thanks,
> Shyam
>
> [a] glusto run:
> https://ci.centos.org/view/Gluster/job/gluster_glusto/377/
> <https://ci.centos.org/view/Gluster/job/gluster_glusto/377/>
>
> [b] analysis of the failure:
> https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w
> <https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w>
>
> On 08/25/2017 04:29 PM, Shyam Ranganathan wrote:
>
> Nigel was kind enough to kick off a glusto run on 3.12 head a
> couple of days back. The status can be seen here [1].
>
> The run failed, but managed to get past what Glusto does on
> master (see [2]). Not that this is a consolation, but just
> stating the fact.
>
> The run [1] failed at,
> 17:05:57
> functional/bvt/test_cvt.py::TestGlusterHealSanity_dispersed_glusterfs::test_self_heal_when_io_in_progress
> FAILED
>
> The test case failed due to,
> 17:10:28 E AssertionError: ('Volume %s : All process are
> not online', 'testvol_dispersed')
>
> The test case can be seen here [3], and the reason for failure
> is that Glusto did not wait long enough for the down brick to
> come up (it waited for 10 seconds, but the brick came up after
> 12 seconds or within the same second as the test for it being
> up. The log snippets pointing to this problem are here [4]. In
> short there was no real bug or issue that caused the failure as yet.
>
> Glusto as a gating factor for this release was desirable, but
> having got this far on 3.12 does help.
>
> @nigel, we could try post increasing the timeout between
> bringing the brick up to checking if it is up, and try another
> run, let me know if that works, and what is needed from me to
> get this going.
>
> Shyam
>
> [1] Glusto 3.12 run:
> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/
> <https://ci.centos.org/view/Gluster/job/gluster_glusto/365/>
>
> [2] Glusto on master:
> https://ci.centos.org/view/Gluster/job/gluster_glusto/360/testReport/functional.bvt.test_cvt/
> <https://ci.centos.org/view/Gluster/job/gluster_glusto/360/testReport/functional.bvt.test_cvt/>
>
>
> [3] Failed test case:
> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dispersed_glusterfs/test_self_heal_when_io_in_progress/
> <https://ci.centos.org/view/Gluster/job/gluster_glusto/365/testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dispersed_glusterfs/test_self_heal_when_io_in_progress/>
>
>
> [4] Log analysis pointing to the failed check:
> https://paste.fedoraproject.org/paste/znTPiFLrc2~vsWuoYRToZA
> <https://paste.fedoraproject.org/paste/znTPiFLrc2%7EvsWuoYRToZA>
>
> "Releases are made better together"
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> http://lists.gluster.org/mailman/listinfo/gluster-devel
> <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> http://lists.gluster.org/mailman/listinfo/gluster-devel
> <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>
>
More information about the Gluster-devel
mailing list