[Gluster-devel] Race in protocol/client and RPC

Xavi Hernandez jahernan at redhat.com
Thu Feb 1 13:25:03 UTC 2018


After having tried several things, it seems that it will be complex to
solve these races. All attempts to fix them have caused failures in other
connections. Since I've other work to do and it doesn't seem to be causing
serious failures in production, for now I'll leave this. I'll retake this
when I've more time.

Xavi

On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez <jahernan at redhat.com>
wrote:

> Hi all,
>
> I've identified a race in RPC layer that caused some spurious
> disconnections and CHILD_DOWN notifications.
>
> The problem happens when protocol/client reconfigures a connection to move
> from glusterd to glusterfsd. This is done by calling rpc_clnt_reconfig()
> followed by rpc_transport_disconnect().
>
> This seems fine because client_rpc_notify() will call
> rpc_clnt_cleanup_and_start() when the disconnect notification is received.
> However There's a problem.
>
> Suppose that the disconnection notification has been executed and we are
> just about to call rpc_clnt_cleanup_and_start(). If at this point the
> reconnection timer is fired, rpc_clnt_reconnect() will be processed. This
> will cause the socket to be reconnected and a connection notification will
> be processed. Then a handshake request will be sent to the server.
>
> However, when rpc_clnt_cleanup_and_start() continues, all sent XID's are
> deleted. When we receive the answer from the handshake, we are unable to
> map the XID, making the request to fail. So the handshake fails and the
> client is considered down, sending a CHILD_DOWN notification to upper
> xlators.
>
> This causes, in some tests, to start processing things while a brick is
> down unexpectedly, causing spurious failures on the test.
>
> To solve the problem I've forced the rpc_clnt_reconfig() function to
> disable the RPC connection using similar code to rcp_clnt_disable(). This
> prevents the background rpc_clnt_reconnect() timer to be executed, avoiding
> the problem.
>
> This seems to work fine for many tests, but it seems to be causing some
> issue in gfapi based tests. I'm still investigating this.
>
> Xavi
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180201/8713e13f/attachment.html>


More information about the Gluster-devel mailing list