[Gluster-devel] Release 3.12: Glusto run status

Mon Aug 28 22:43:03 UTC 2017

Nigel, Shwetha,

The latest Glusto run [a] that was started by Nigel, post fixing the 
prior timeout issue, failed (much later though) again.

I took a look at the logs and my analysis is here [b]

@atin, @kaushal, @ppai can you take a look and see if the analysis is 
correct?

In short glusterd has got an error when checking for rebalance stats 
from one of the nodes as:
"Received commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d"

and the rebalance deamon on the node with that UUID is not really ready 
to serve requests when this was called, hence I am assuming this is 
causing the error. But need a once over by one of you folks.

@Shwetha, can we add a further timeout between rebalance start and 
checking the status, just so that we avoid this timing issue on these nodes.

Thanks,
Shyam

[a] glusto run: https://ci.centos.org/view/Gluster/job/gluster_glusto/377/

[b] analysis of the failure: 
https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w
On 08/25/2017 04:29 PM, Shyam Ranganathan wrote:
> Nigel was kind enough to kick off a glusto run on 3.12 head a couple of 
> days back. The status can be seen here [1].
> 
> The run failed, but managed to get past what Glusto does on master (see 
> [2]). Not that this is a consolation, but just stating the fact.
> 
> The run [1] failed at,
> 17:05:57 
> functional/bvt/test_cvt.py::TestGlusterHealSanity_dispersed_glusterfs::test_self_heal_when_io_in_progress 
> FAILED
> 
> The test case failed due to,
> 17:10:28 E       AssertionError: ('Volume %s : All process are not 
> online', 'testvol_dispersed')
> 
> The test case can be seen here [3], and the reason for failure is that 
> Glusto did not wait long enough for the down brick to come up (it waited 
> for 10 seconds, but the brick came up after 12 seconds or within the 
> same second as the test for it being up. The log snippets pointing to 
> this problem are here [4]. In short there was no real bug or issue that 
> caused the failure as yet.
> 
> Glusto as a gating factor for this release was desirable, but having got 
> this far on 3.12 does help.
> 
> @nigel, we could try post increasing the timeout between bringing the 
> brick up to checking if it is up, and try another run, let me know if 
> that works, and what is needed from me to get this going.
> 
> Shyam
> 
> [1] Glusto 3.12 run: 
> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/
> 
> [2] Glusto on master: 
> https://ci.centos.org/view/Gluster/job/gluster_glusto/360/testReport/functional.bvt.test_cvt/ 
> 
> 
> [3] Failed test case: 
> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dispersed_glusterfs/test_self_heal_when_io_in_progress/ 
> 
> 
> [4] Log analysis pointing to the failed check: 
> https://paste.fedoraproject.org/paste/znTPiFLrc2~vsWuoYRToZA
> 
> "Releases are made better together"
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel