[Gluster-devel] Race in protocol/client and RPC

Thu Feb 1 14:55:27 UTC 2018

On Thu, Feb 1, 2018 at 2:48 PM, Shyam Ranganathan <srangana at redhat.com>
wrote:

> On 02/01/2018 08:25 AM, Xavi Hernandez wrote:
> > After having tried several things, it seems that it will be complex to
> > solve these races. All attempts to fix them have caused failures in
> > other connections. Since I've other work to do and it doesn't seem to be
> > causing serious failures in production, for now I'll leave this. I'll
> > retake this when I've more time.
>
> Xavi, convert the findings into a bug, and post the details there, so
> that it may be followed up? (if not already done)
>

I've just created this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1541032

> >
> > Xavi
> >
> > On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez <jahernan at redhat.com
> > <mailto:jahernan at redhat.com>> wrote:
> >
> >     Hi all,
> >
> >     I've identified a race in RPC layer that caused some spurious
> >     disconnections and CHILD_DOWN notifications.
> >
> >     The problem happens when protocol/client reconfigures a connection
> >     to move from glusterd to glusterfsd. This is done by calling
> >     rpc_clnt_reconfig() followed by rpc_transport_disconnect().
> >
> >     This seems fine because client_rpc_notify() will call
> >     rpc_clnt_cleanup_and_start() when the disconnect notification is
> >     received. However There's a problem.
> >
> >     Suppose that the disconnection notification has been executed and we
> >     are just about to call rpc_clnt_cleanup_and_start(). If at this
> >     point the reconnection timer is fired, rpc_clnt_reconnect() will be
> >     processed. This will cause the socket to be reconnected and a
> >     connection notification will be processed. Then a handshake request
> >     will be sent to the server.
> >
> >     However, when rpc_clnt_cleanup_and_start() continues, all sent XID's
> >     are deleted. When we receive the answer from the handshake, we are
> >     unable to map the XID, making the request to fail. So the handshake
> >     fails and the client is considered down, sending a CHILD_DOWN
> >     notification to upper xlators.
> >
> >     This causes, in some tests, to start processing things while a brick
> >     is down unexpectedly, causing spurious failures on the test.
> >
> >     To solve the problem I've forced the rpc_clnt_reconfig() function to
> >     disable the RPC connection using similar code to rcp_clnt_disable().
> >     This prevents the background rpc_clnt_reconnect() timer to be
> >     executed, avoiding the problem.
> >
> >     This seems to work fine for many tests, but it seems to be causing
> >     some issue in gfapi based tests. I'm still investigating this.
> >
> >     Xavi
> >
> >
> >
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-devel
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180201/ee490b8a/attachment.html>