<div dir="ltr"><div><div><div>The volume stop, in brick-mux mode reveals a race with my patch [1]<br></div>Although this behavior is 100% reproducible with my patch, this, by no means, implies that my patch is buggy.<br><br></div>In
brick-mux mode, during volume stop, when glusterd sends a brick-detach
message to the brick process for the last brick, the brick process
responds back to glusterd with an acknowledgment and then kills itself
with a SIGTERM signal. All this sounds fine. However, somehow, the
response doesn't reach glusterd and instead a socket disconnect
notification reaches glusterd before the response. This causes glusterd
to presume that something has gone wrong during volume stop and glusterd
then fails the volume stop operation causing the test to fail.<br><br></div>This race is reproducible by running the test tests/basic/distribute/rebal-all-nodes-migrate.t in brick-mux mode for my patch [1]<br><div><br><div><div>[1] <a href="https://review.gluster.org/19308">https://review.gluster.org/19308</a></div><div><br></div></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Feb 1, 2018 at 9:54 AM, Atin Mukherjee <span dir="ltr"><<a href="mailto:amukherj@redhat.com" target="_blank">amukherj@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I don't think that's the right way. Ideally the test shouldn't be attempting to stop a volume if rebalance session is in progress. If we do see such a situation even with we check for rebalance status and wait till it finishes for 30 secs and still volume stop fails with rebalance session in progress error, that means either (a) rebalance session took more than the timeout which has been passed to EXPECT_WITHIN or (b) there's a bug in the code.<br></div><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">On Thu, Feb 1, 2018 at 9:46 AM, Milind Changire <span dir="ltr"><<a href="mailto:mchangir@redhat.com" target="_blank">mchangir@redhat.com</a>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"><div dir="ltr"><div>If a *volume stop* fails at a user's production site with a reason like *rebalance session is active* then the admin will wait for the session to complete and then reissue a *volume stop*;<br><br>So, in essence, the failed volume stop is not fatal; for the regression tests, I would like to propose to change a single volume stop to *EXPECT_WITHIN 30* so that a if a volume cannot be stopped even after 30 seconds, then it could be termed fatal in the regressions scenario<br clear="all"><br></div>Any comments about the proposal ?<span class="m_-592390432689262335HOEnZb"><font color="#888888"><br><br><div>-- <br><div class="m_-592390432689262335m_-3520425827621538936gmail_signature"><div dir="ltr"><div><div dir="ltr">Milind<br><br></div></div></div></div>
</div></font></span></div>
<br></div></div>______________________________<wbr>_________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>
<a href="http://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://lists.gluster.org/mailm<wbr>an/listinfo/gluster-devel</a><br></blockquote></div><br></div>
</blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">Milind<br><br></div></div></div></div>
</div>