<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Feb 1, 2018 at 2:48 PM, Shyam Ranganathan <span dir="ltr"><<a href="mailto:srangana@redhat.com" target="_blank">srangana@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">On 02/01/2018 08:25 AM, Xavi Hernandez wrote:<br>
> After having tried several things, it seems that it will be complex to<br>
> solve these races. All attempts to fix them have caused failures in<br>
> other connections. Since I've other work to do and it doesn't seem to be<br>
> causing serious failures in production, for now I'll leave this. I'll<br>
> retake this when I've more time.<br>
<br>
</span>Xavi, convert the findings into a bug, and post the details there, so<br>
that it may be followed up? (if not already done)<br></blockquote><div><br></div><div>I've just created this bug: <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1541032">https://bugzilla.redhat.com/show_bug.cgi?id=1541032</a></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-im gmail-HOEnZb"><br>
><br>
> Xavi<br>
><br>
> On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez <<a href="mailto:jahernan@redhat.com">jahernan@redhat.com</a><br>
</span><div class="gmail-HOEnZb"><div class="gmail-h5">> <mailto:<a href="mailto:jahernan@redhat.com">jahernan@redhat.com</a>>> wrote:<br>
><br>
> Hi all,<br>
><br>
> I've identified a race in RPC layer that caused some spurious<br>
> disconnections and CHILD_DOWN notifications.<br>
><br>
> The problem happens when protocol/client reconfigures a connection<br>
> to move from glusterd to glusterfsd. This is done by calling<br>
> rpc_clnt_reconfig() followed by rpc_transport_disconnect().<br>
><br>
> This seems fine because client_rpc_notify() will call<br>
> rpc_clnt_cleanup_and_start() when the disconnect notification is<br>
> received. However There's a problem.<br>
><br>
> Suppose that the disconnection notification has been executed and we<br>
> are just about to call rpc_clnt_cleanup_and_start(). If at this<br>
> point the reconnection timer is fired, rpc_clnt_reconnect() will be<br>
> processed. This will cause the socket to be reconnected and a<br>
> connection notification will be processed. Then a handshake request<br>
> will be sent to the server.<br>
><br>
> However, when rpc_clnt_cleanup_and_start() continues, all sent XID's<br>
> are deleted. When we receive the answer from the handshake, we are<br>
> unable to map the XID, making the request to fail. So the handshake<br>
> fails and the client is considered down, sending a CHILD_DOWN<br>
> notification to upper xlators.<br>
><br>
> This causes, in some tests, to start processing things while a brick<br>
> is down unexpectedly, causing spurious failures on the test.<br>
><br>
> To solve the problem I've forced the rpc_clnt_reconfig() function to<br>
> disable the RPC connection using similar code to rcp_clnt_disable().<br>
> This prevents the background rpc_clnt_reconnect() timer to be<br>
> executed, avoiding the problem.<br>
><br>
> This seems to work fine for many tests, but it seems to be causing<br>
> some issue in gfapi based tests. I'm still investigating this.<br>
><br>
> Xavi<br>
><br>
><br>
><br>
><br>
</div></div><div class="gmail-HOEnZb"><div class="gmail-h5">> ______________________________<wbr>_________________<br>
> Gluster-devel mailing list<br>
> <a href="mailto:Gluster-devel@gluster.org">Gluster-devel@gluster.org</a><br>
> <a href="http://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>mailman/listinfo/gluster-devel</a><br>
><br>
</div></div></blockquote></div><br></div></div>